Full-stack data ingestion in AWS (overview)
Context and goal
This pipeline ingests data daily from a third-party API into AWS. The goal is to keep a reliable, traceable flow from source to curated datasets without exposing internal implementation details.
What I mean by “full-stack” here: the project covers ingestion logic, data modeling, orchestration, and infrastructure as code, so the whole path is owned end-to-end.
High-level architecture
The flow is intentionally simple:
API → Lambda → S3 (Raw / Prepare / Refine) → PySpark → curated outputs
Key design principles:
- Reproducibility: the same job runs locally and in production.
- Traceability: each layer has a clear contract.
- Separation of concerns: ingest, transform, and publish are independent steps.
Local development (parity with production)
Locally, the pipeline runs inside a Docker image based on PySpark. This gives two benefits:
- The environment is consistent across machines.
- The code behaves the same as in production, which avoids “it works on my laptop” surprises.
Running locally helps validate transforms and schema changes without touching production data.
Ingestion (API → Lambda)
Ingestion is handled by a Lambda function that consumes the API and writes to the Raw zone. At a high level:
- It pulls data from the API with proper pagination.
- It respects rate limits.
- It logs what was fetched so runs can be audited.
This step is designed to be idempotent and safe to retry.
Data layers (Raw / Prepare / Refine)
The pipeline uses three data zones:
- Raw: unmodified source payloads for full traceability.
- Prepare: cleaned, normalized data with basic validation.
- Refine: curated datasets ready for downstream use.
Each layer has clear input/output expectations. This separation makes debugging and backfills easier.
Orchestration
The pipeline runs once per day on a schedule. If something fails, retries are handled at the job level, and failures are visible in logs/alerts.
Infrastructure as code (Terraform)
Terraform defines the infrastructure: storage, IAM roles, the Lambda function, and other AWS resources. This makes:
- Environments reproducible.
- Changes auditable.
- The pipeline portable to new accounts or regions.
Security and access (high level)
Access is scoped with least-privilege permissions. Secrets are not hardcoded and are managed outside the codebase.
Next steps
This is the overview. In future posts, I may add:
- Monitoring dashboards and alerting strategy.
- Schema evolution and data quality checks.
- Performance tuning for larger volumes.