System Design: End-to-End Data Platform for Product Analytics (Batch + Near-Real-Time)
Context
Design a scalable data platform for a large consumer product with web and mobile clients. The platform must power daily product analytics (e.g., DAU/MAU, retention, funnels, cohorts, experiments) and near-real-time dashboards (<5 minutes end-to-end) while supporting backfills and rigorous data quality.
Assume tens to hundreds of millions of daily events and multiple upstream systems (client telemetry, backend logs, relational OLTP for user/account, and third-party data). You may reference common technologies (e.g., Kafka, Flink/Spark, object store + lakehouse table format, a cloud data warehouse), but focus on design choices and trade-offs.
Requirements
-
Ingestion sources and formats
-
Identify sources (client events, backend logs, CDC from OLTP, third-party feeds) and wire formats (JSON/Protobuf/Avro on the wire; Parquet/Delta/Hudi/Iceberg in storage).
-
Storage and compute architecture
-
Describe the messaging/streaming layer, raw landing, staging, and modeled layers, and the batch/streaming compute engines.
-
Schema design by layer
-
Define schemas for raw ("bronze"), deduped/cleaned ("silver"), and modeled analytics ("gold"). Include a canonical event envelope, dimensions (users/devices/products/experiments), and fact tables (events, sessions, conversions).
-
Table partitioning and clustering
-
Propose partitioning and clustering/sorting for each major table to optimize scan cost and latency.
-
Idempotency, deduplication, and late/out-of-order events
-
Specify unique keys, event-time vs ingestion-time, watermarking, allowed lateness, and how to reconcile late data into aggregates.
-
Update patterns and history
-
State which layers are append-only vs upsert/merge. Explain SCD1 vs SCD2 for dimensions, identity resolution (anonymous → logged-in), and how you will run backfills safely.
-
Orchestration, dependencies, and failure recovery
-
Describe scheduling, dependency management, retries, checkpointing, and exactly-once/at-least-once guarantees.
-
Aggregations for daily/hourly/rolling metrics
-
Define how to compute daily/hourly windows and rolling windows (e.g., 7/28-day active, retention, funnel steps), both in streaming and batch.
-
Data quality and SLAs
-
Outline schema enforcement, validation tests, anomaly detection, freshness/completeness SLAs, and alerting.
-
Trade-offs
-
Discuss latency vs cost vs complexity; lambda vs kappa patterns; when to pre-aggregate vs compute on read; and real-time store choices.