ETL Pipeline Intake Questionnaire
Fill in all sections to capture your complete Bronze → Silver → Gold pipeline requirements.
Bronze
Source
— Data source configuration
Source type(s)
*
CSV / flat files
REST API
PostgreSQL
MySQL
S3 / cloud storage
Kafka / streaming
Other
Source description
*
e.g. Daily CSV export of sales transactions from our ERP, ~50k rows/day
Estimated volume per load
e.g. 50,000 rows / 200 MB
Load frequency
Select…
Real-time / streaming
Hourly
Daily
Weekly
Monthly
On-demand / manual
Key fields / schema
e.g. order_id (int), customer_id (int), amount (decimal), created_at (timestamp)
Incremental load key
e.g. updated_at, created_date, sequence_id — leave blank for full-load
Bronze
Ingestion
— Raw ingestion requirements
Bronze table name
*
e.g. sales_orders
Deduplication strategy
Select…
None — accept duplicates
Deduplicate on ingest
Post-ingest dedupe step
Natural / business key columns
e.g. order_id OR customer_id + order_date
Additional bronze requirements
e.g. Must retain original file path, need error quarantine table…
Silver
Cleaning
— Transformation & validation rules
Data quality checks required
Null checks
Type casting
Range / value validation
Regex / format
Referential integrity
Deduplication
Transformations needed
e.g. Normalise country codes to ISO 3166, parse amount from string to decimal…
Silver table name
*
e.g. sales_orders_clean
Bad-row handling
Select…
Reject & log
Quarantine table
Coerce & flag
Drop silently
SCD / history tracking needed?
Select…
No — latest value only
SCD Type 1 — overwrite
SCD Type 2 — full history rows
SCD Type 4 — separate history table
Gold
Serving
— Business logic & aggregations
What business question should the gold layer answer?
*
e.g. Daily revenue per product category, monthly customer cohort retention…
Gold table(s) / views needed
e.g. daily_revenue_by_category, customer_ltv_monthly
Joins / lookups required
e.g. Join orders to customers on customer_id, enrich with product dimension table…
Aggregation dimensions & metrics
e.g. GROUP BY region, product_category, date — SUM(revenue), COUNT(orders)
Downstream consumers
BI tool
App / API
Data science / ML
Scheduled reports
Other
Infra
Infrastructure
— PostgreSQL & environment setup
PostgreSQL version
e.g. 15, 16
Environment
Select…
Dev
Staging
Prod
Dev + Staging + Prod
Connection pool / concurrency expectations
e.g. Single-threaded daily job, or 5 parallel workers
Orchestration / scheduler
Select…
Cron
Apache Airflow
Prefect
dbt Cloud
Luigi
Manual / ad-hoc
Python libraries already in use
e.g. pandas, SQLAlchemy, psycopg2, pydantic
Ops
Operations
— Observability, alerting & recovery
Required operational features
Structured logging
Row-count reconciliation
Audit table
Failure alerts
Retry with backoff
Idempotent re-runs
Backfill support
SLA / latency requirement
e.g. Gold data must be available by 6 AM UTC each day
Anything else to capture?
Security constraints, compliance requirements, quirks of the source data…
Clear
Copy summary
Copied to clipboard!