Design an end-to-end data platform that supports both daily batch processing and near-real-time streaming for product analytics. Specify the ingestion sources and formats; schema design for raw, staging, and modeled layers; table partitioning and clustering; strategies for idempotency, deduplication, and handling late/out-of-order events; update patterns (append-only vs upsert/merge), slowly changing dimensions (SCD1/SCD 2), and backfills; orchestration, dependency management, and failure recovery; aggregation strategies for daily/hourly/rolling-window metrics; data quality checks and SLAs; and trade-offs between latency, cost, and complexity.

This question evaluates a data engineer's ability to design scalable batch and streaming ETL architectures, including competencies in ingestion patterns, messaging and storage layers, schema modeling, partitioning and clustering, deduplication and late-event handling, aggregation strategies, and data quality and SLA enforcement.

How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Onsite rounds at Meta.

What role is this question designed for?

This question is commonly asked for Data Engineer candidates at Meta during technical interviews.

Design batch and streaming ETL architecture

System Design: End-to-End Data Platform for Product Analytics (Batch + Near-Real-Time)

Context

Design a scalable data platform for a large consumer product with web and mobile clients. The platform must power daily product analytics (e.g., DAU/MAU, retention, funnels, cohorts, experiments) and near-real-time dashboards (<5 minutes end-to-end) while supporting backfills and rigorous data quality.

Assume tens to hundreds of millions of daily events and multiple upstream systems (client telemetry, backend logs, relational OLTP for user/account, and third-party data). You may reference common technologies (e.g., Kafka, Flink/Spark, object store + lakehouse table format, a cloud data warehouse), but focus on design choices and trade-offs.

Requirements

Ingestion sources and formats

Identify sources (client events, backend logs, CDC from OLTP, third-party feeds) and wire formats (JSON/Protobuf/Avro on the wire; Parquet/Delta/Hudi/Iceberg in storage).

Storage and compute architecture

Describe the messaging/streaming layer, raw landing, staging, and modeled layers, and the batch/streaming compute engines.

Schema design by layer

Define schemas for raw ("bronze"), deduped/cleaned ("silver"), and modeled analytics ("gold"). Include a canonical event envelope, dimensions (users/devices/products/experiments), and fact tables (events, sessions, conversions).

Table partitioning and clustering

Propose partitioning and clustering/sorting for each major table to optimize scan cost and latency.

Idempotency, deduplication, and late/out-of-order events

Specify unique keys, event-time vs ingestion-time, watermarking, allowed lateness, and how to reconcile late data into aggregates.

Update patterns and history

State which layers are append-only vs upsert/merge. Explain SCD1 vs SCD2 for dimensions, identity resolution (anonymous → logged-in), and how you will run backfills safely.

Orchestration, dependencies, and failure recovery

Describe scheduling, dependency management, retries, checkpointing, and exactly-once/at-least-once guarantees.

Aggregations for daily/hourly/rolling metrics

Define how to compute daily/hourly windows and rolling windows (e.g., 7/28-day active, retention, funnel steps), both in streaming and batch.

Data quality and SLAs

Outline schema enforcement, validation tests, anomaly detection, freshness/completeness SLAs, and alerting.

Trade-offs

Discuss latency vs cost vs complexity; lambda vs kappa patterns; when to pre-aggregate vs compute on read; and real-time store choices.

System Design: End-to-End Data Platform for Product Analytics (Batch + Near-Real-Time)

Context

Requirements

Ingestion sources and formats

Identify sources (client events, backend logs, CDC from OLTP, third-party feeds) and wire formats (JSON/Protobuf/Avro on the wire; Parquet/Delta/Hudi/Iceberg in storage).

Storage and compute architecture

Describe the messaging/streaming layer, raw landing, staging, and modeled layers, and the batch/streaming compute engines.

Schema design by layer

Define schemas for raw ("bronze"), deduped/cleaned ("silver"), and modeled analytics ("gold"). Include a canonical event envelope, dimensions (users/devices/products/experiments), and fact tables (events, sessions, conversions).

Table partitioning and clustering

Propose partitioning and clustering/sorting for each major table to optimize scan cost and latency.

Idempotency, deduplication, and late/out-of-order events

Specify unique keys, event-time vs ingestion-time, watermarking, allowed lateness, and how to reconcile late data into aggregates.

Update patterns and history

State which layers are append-only vs upsert/merge. Explain SCD1 vs SCD2 for dimensions, identity resolution (anonymous → logged-in), and how you will run backfills safely.

Orchestration, dependencies, and failure recovery

Describe scheduling, dependency management, retries, checkpointing, and exactly-once/at-least-once guarantees.

Aggregations for daily/hourly/rolling metrics

Define how to compute daily/hourly windows and rolling windows (e.g., 7/28-day active, retention, funnel steps), both in streaming and batch.

Data quality and SLAs

Outline schema enforcement, validation tests, anomaly detection, freshness/completeness SLAs, and alerting.

Trade-offs

Discuss latency vs cost vs complexity; lambda vs kappa patterns; when to pre-aggregate vs compute on read; and real-time store choices.

Design batch and streaming ETL architecture

Quick Overview

System Design: End-to-End Data Platform for Product Analytics (Batch + Near-Real-Time)

Context

Requirements

Solution

Comments (0)

Design batch and streaming ETL architecture

Quick Overview

System Design: End-to-End Data Platform for Product Analytics (Batch + Near-Real-Time)

Context

Requirements

Solution

Comments (0)