ETL Pipeline Intake Form

Bronze Source — Data source configuration

Source type(s)*

Source description* e.g. Daily CSV export of sales transactions from our ERP, ~50k rows/day

Estimated volume per load e.g. 50,000 rows / 200 MB

Load frequency

Key fields / schema e.g. order_id (int), customer_id (int), amount (decimal), created_at (timestamp)

Incremental load key e.g. updated_at, created_date, sequence_id — leave blank for full-load

Bronze Ingestion — Raw ingestion requirements

Bronze table name* e.g. sales_orders

Deduplication strategy

Natural / business key columns e.g. order_id OR customer_id + order_date

Additional bronze requirements e.g. Must retain original file path, need error quarantine table…

Silver Cleaning — Transformation & validation rules

Data quality checks required

Transformations needed e.g. Normalise country codes to ISO 3166, parse amount from string to decimal…

Silver table name* e.g. sales_orders_clean

Bad-row handling

SCD / history tracking needed?

Gold Serving — Business logic & aggregations

What business question should the gold layer answer?* e.g. Daily revenue per product category, monthly customer cohort retention…

Gold table(s) / views needed e.g. daily_revenue_by_category, customer_ltv_monthly

Joins / lookups required e.g. Join orders to customers on customer_id, enrich with product dimension table…

Aggregation dimensions & metrics e.g. GROUP BY region, product_category, date — SUM(revenue), COUNT(orders)

Downstream consumers

Infra Infrastructure — PostgreSQL & environment setup

PostgreSQL version e.g. 15, 16

Environment

Connection pool / concurrency expectations e.g. Single-threaded daily job, or 5 parallel workers

Orchestration / scheduler

Python libraries already in use e.g. pandas, SQLAlchemy, psycopg2, pydantic

Ops Operations — Observability, alerting & recovery

Required operational features

Structured logging Row-count reconciliation Audit table Failure alerts Retry with backoff Idempotent re-runs Backfill support

SLA / latency requirement e.g. Gold data must be available by 6 AM UTC each day

Anything else to capture? Security constraints, compliance requirements, quirks of the source data…

ETL Pipeline Intake Questionnaire