Jan 22, 2026

Full-stack data ingestion in AWS (overview)

Context and goal

This pipeline ingests data daily from a third-party API into AWS. The goal is to keep a reliable, traceable flow from source to curated datasets without exposing internal implementation details.

What I mean by “full-stack” here: the project covers ingestion logic, data modeling, orchestration, and infrastructure as code, so the whole path is owned end-to-end.

High-level architecture

The flow is intentionally simple:

API → Lambda → S3 (Raw / Prepare / Refine) → PySpark → curated outputs

Key design principles:

Reproducibility: the same job runs locally and in production.
Traceability: each layer has a clear contract.
Separation of concerns: ingest, transform, and publish are independent steps.

Local development (parity with production)

Locally, the pipeline runs inside a Docker image based on PySpark. This gives two benefits:

The environment is consistent across machines.
The code behaves the same as in production, which avoids “it works on my laptop” surprises.

Running locally helps validate transforms and schema changes without touching production data.

Ingestion (API → Lambda)

Ingestion is handled by a Lambda function that consumes the API and writes to the Raw zone. At a high level:

It pulls data from the API with proper pagination.
It respects rate limits.
It logs what was fetched so runs can be audited.

This step is designed to be idempotent and safe to retry.

Data layers (Raw / Prepare / Refine)

The pipeline uses three data zones:

Raw: unmodified source payloads for full traceability.
Prepare: cleaned, normalized data with basic validation.
Refine: curated datasets ready for downstream use.

Each layer has clear input/output expectations. This separation makes debugging and backfills easier.

Orchestration

The pipeline runs once per day on a schedule. If something fails, retries are handled at the job level, and failures are visible in logs/alerts.

Infrastructure as code (Terraform)

Terraform defines the infrastructure: storage, IAM roles, the Lambda function, and other AWS resources. This makes:

Environments reproducible.
Changes auditable.
The pipeline portable to new accounts or regions.

Security and access (high level)

Access is scoped with least-privilege permissions. Secrets are not hardcoded and are managed outside the codebase.

Next steps

This is the overview. In future posts, I may add:

Monitoring dashboards and alerting strategy.
Schema evolution and data quality checks.
Performance tuning for larger volumes.