Design batch and streaming ETL architecture
System Design: End-to-End Data Platform for Product Analytics (Batch + Near-Real-Time)
Context
Design a scalable data platform for a large consumer product with web and mobile clients. The platform must power daily product analytics (e.g., DAU/MAU, retention, funnels, cohorts, experiments) and near-real-time dashboards (<5 minutes end-to-end) while supporting backfills and rigorous data quality.
Assume tens to hundreds of millions of daily events and multiple upstream systems (client telemetry, backend logs, relational OLTP for user/account, and third-party data). You may reference common technologies (e.g., Kafka, Flink/Spark, object store + lakehouse table format, a cloud data warehouse), but focus on design choices and trade-offs.
Requirements
-
Ingestion sources and formats
-
Identify sources (client events, backend logs, CDC from OLTP, third-party feeds) and wire formats (JSON/Protobuf/Avro on the wire; Parquet/Delta/Hudi/Iceberg in storage).
-
Storage and compute architecture
-
Describe the messaging/streaming layer, raw landing, staging, and modeled layers, and the batch/streaming compute engines.
-
Schema design by layer
-
Define schemas for raw ("bronze"), deduped/cleaned ("silver"), and modeled analytics ("gold"). Include a canonical event envelope, dimensions (users/devices/products/experiments), and fact tables (events, sessions, conversions).
-
Table partitioning and clustering
-
Propose partitioning and clustering/sorting for each major table to optimize scan cost and latency.
-
Idempotency, deduplication, and late/out-of-order events
-
Specify unique keys, event-time vs ingestion-time, watermarking, allowed lateness, and how to reconcile late data into aggregates.
-
Update patterns and history
-
State which layers are append-only vs upsert/merge. Explain SCD1 vs SCD2 for dimensions, identity resolution (anonymous → logged-in), and how you will run backfills safely.
-
Orchestration, dependencies, and failure recovery
-
Describe scheduling, dependency management, retries, checkpointing, and exactly-once/at-least-once guarantees.
-
Aggregations for daily/hourly/rolling metrics
-
Define how to compute daily/hourly windows and rolling windows (e.g., 7/28-day active, retention, funnel steps), both in streaming and batch.
-
Data quality and SLAs
-
Outline schema enforcement, validation tests, anomaly detection, freshness/completeness SLAs, and alerting.
-
Trade-offs
-
Discuss latency vs cost vs complexity; lambda vs kappa patterns; when to pre-aggregate vs compute on read; and real-time store choices.
Constraints & Assumptions
-
Preserve the scope, facts, inputs, and requested outputs from the prompt above.
-
If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
-
Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.
Clarifying Questions to Ask
-
Clarify users, core use cases, read/write patterns, scale, latency, availability, and data retention.
-
State explicit assumptions before making sizing or architecture decisions.
-
Prioritize the functional path first, then address reliability, security, observability, and rollout.
What a Strong Answer Covers
-
A scoped requirements summary with concrete non-goals and success metrics.
-
API, data model, architecture, consistency, capacity, and operations.
-
Reasoned trade-offs among simple and scalable designs, including bottlenecks and failure modes.
-
A validation, monitoring, migration, and launch plan appropriate for the risk level.
Follow-up Questions
-
What breaks first at 10x traffic or data volume?
-
How would you degrade gracefully during dependency failures?
-
What metrics and alerts would prove the design is healthy after launch?