Design batch and streaming ETL architecture

Q: Design batch and streaming ETL architecture

This question evaluates a data engineer's ability to design scalable batch and streaming ETL architectures, including competencies in ingestion patterns, messaging and storage layers, schema modeling, partitioning and clustering, deduplication and late-event handling, aggregation strategies, and data quality and SLA enforcement.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: End-to-End Data Platform for Product Analytics (Batch + Near-Real-Time)

Context

Design a scalable data platform for a large consumer product with web and mobile clients. The platform must power daily product analytics (e.g., DAU/MAU, retention, funnels, cohorts, experiments) and near-real-time dashboards (<5 minutes end-to-end) while supporting backfills and rigorous data quality.

Assume tens to hundreds of millions of daily events and multiple upstream systems (client telemetry, backend logs, relational OLTP for user/account, and third-party data). You may reference common technologies (e.g., Kafka, Flink/Spark, object store + lakehouse table format, a cloud data warehouse), but focus on design choices and trade-offs.

Requirements

Ingestion sources and formats

Identify sources (client events, backend logs, CDC from OLTP, third-party feeds) and wire formats (JSON/Protobuf/Avro on the wire; Parquet/Delta/Hudi/Iceberg in storage).

Storage and compute architecture

Describe the messaging/streaming layer, raw landing, staging, and modeled layers, and the batch/streaming compute engines.

Schema design by layer

Define schemas for raw ("bronze"), deduped/cleaned ("silver"), and modeled analytics ("gold"). Include a canonical event envelope, dimensions (users/devices/products/experiments), and fact tables (events, sessions, conversions).

Table partitioning and clustering

Propose partitioning and clustering/sorting for each major table to optimize scan cost and latency.

Idempotency, deduplication, and late/out-of-order events

Specify unique keys, event-time vs ingestion-time, watermarking, allowed lateness, and how to reconcile late data into aggregates.

Update patterns and history

State which layers are append-only vs upsert/merge. Explain SCD1 vs SCD2 for dimensions, identity resolution (anonymous → logged-in), and how you will run backfills safely.

Orchestration, dependencies, and failure recovery

Describe scheduling, dependency management, retries, checkpointing, and exactly-once/at-least-once guarantees.

Aggregations for daily/hourly/rolling metrics

Define how to compute daily/hourly windows and rolling windows (e.g., 7/28-day active, retention, funnel steps), both in streaming and batch.

Data quality and SLAs

Outline schema enforcement, validation tests, anomaly detection, freshness/completeness SLAs, and alerting.

Trade-offs

Discuss latency vs cost vs complexity; lambda vs kappa patterns; when to pre-aggregate vs compute on read; and real-time store choices.

Design batch and streaming ETL architecture

Quick Overview

System Design: End-to-End Data Platform for Product Analytics (Batch + Near-Real-Time)

Context

Requirements

Solution

Comments (0)