Full-stack data ingestion in AWS (overview)


Context and goal

This pipeline ingests data daily from a third-party API into AWS. The goal is to keep a reliable, traceable flow from source to curated datasets without exposing internal implementation details.

What I mean by “full-stack” here: the project covers ingestion logic, data modeling, orchestration, and infrastructure as code, so the whole path is owned end-to-end.


High-level architecture

The flow is intentionally simple:

API → Lambda → S3 (Raw / Prepare / Refine) → PySpark → curated outputs

Key design principles:

  • Reproducibility: the same job runs locally and in production.
  • Traceability: each layer has a clear contract.
  • Separation of concerns: ingest, transform, and publish are independent steps.

Local development (parity with production)

Locally, the pipeline runs inside a Docker image based on PySpark. This gives two benefits:

  • The environment is consistent across machines.
  • The code behaves the same as in production, which avoids “it works on my laptop” surprises.

Running locally helps validate transforms and schema changes without touching production data.


Ingestion (API → Lambda)

Ingestion is handled by a Lambda function that consumes the API and writes to the Raw zone. At a high level:

  • It pulls data from the API with proper pagination.
  • It respects rate limits.
  • It logs what was fetched so runs can be audited.

This step is designed to be idempotent and safe to retry.


Data layers (Raw / Prepare / Refine)

The pipeline uses three data zones:

  • Raw: unmodified source payloads for full traceability.
  • Prepare: cleaned, normalized data with basic validation.
  • Refine: curated datasets ready for downstream use.

Each layer has clear input/output expectations. This separation makes debugging and backfills easier.


Orchestration

The pipeline runs once per day on a schedule. If something fails, retries are handled at the job level, and failures are visible in logs/alerts.


Infrastructure as code (Terraform)

Terraform defines the infrastructure: storage, IAM roles, the Lambda function, and other AWS resources. This makes:

  • Environments reproducible.
  • Changes auditable.
  • The pipeline portable to new accounts or regions.

Security and access (high level)

Access is scoped with least-privilege permissions. Secrets are not hardcoded and are managed outside the codebase.


Next steps

This is the overview. In future posts, I may add:

  • Monitoring dashboards and alerting strategy.
  • Schema evolution and data quality checks.
  • Performance tuning for larger volumes.