How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Take-home Project rounds at Disney.

What role is this question designed for?

This question is commonly asked for Data Engineer candidates at Disney during technical interviews.

Design a Pipeline to Process a Billion Clickstream Events Per Day

Q: Design a Pipeline to Process a Billion Clickstream Events Per Day

This system design question tests a data engineer's ability to architect high-throughput streaming pipelines handling massive event volumes with strict latency and correctness requirements. It evaluates practical knowledge of distributed stream processing concepts including event-time semantics, watermarking, exactly-once delivery guarantees, and Lambda/Kappa architecture trade-offs.

You are designing the data infrastructure for a large consumer web/mobile product. The product emits a clickstream: every page view, tap, scroll, and interaction generates a telemetry event. At peak the product produces on the order of one billion clickstream events per day, and the business wants near-real-time dashboards (daily and hourly active users, funnel conversion, feature adoption) plus accurate daily aggregates for finance and analytics.

Design an end-to-end pipeline that ingests, processes, and serves these billion events per day without lagging — i.e., the processing layer must keep up with the incoming event rate so dashboards reflect reality within minutes, not hours. Cover ingestion, the streaming/batch processing layer, storage, and how aggregates are computed and served.

You should specifically address three operational realities of clickstream telemetry:

Lag and load: how you detect that the pipeline is falling behind and how you keep it caught up as traffic spikes.
Late-arriving data: a user's device goes offline and syncs hours (or days) later, so events arrive long after the time they actually occurred. How do you keep daily aggregates accurate despite this?
Duplicates / exactly-once: network retries and at-least-once delivery mean the same event can appear multiple times. How do you guarantee each event is counted exactly once in your metrics?

Constraints & Assumptions

~1 billion events/day average; assume peaks of roughly 3–5x the average rate (diurnal + launch spikes), so design for a sustained tens-of-thousands of events/sec and bursts above that.
Each event is a small JSON-ish record (user/device id, event type, event timestamp, session id, properties) — order of a few hundred bytes to ~1 KB.
Metrics consumers want: hourly freshness for real-time dashboards (a few minutes of end-to-end lag is acceptable), and accurate, finalized daily aggregates that downstream finance/reporting can trust.
Events carry an event-time (when it happened on the device) that can differ substantially from ingest-time (when the server received it).
Delivery from clients/collectors is at-least-once ; you cannot assume the network delivers each event exactly once.
Cloud-managed building blocks (a distributed log, a stream processor, object storage, a columnar warehouse/OLAP store) are available; you do not need to run bare-metal.

Clarifying Questions to Ask

How fresh must the real-time dashboards be (single-digit minutes vs. tens of minutes), and how "final" must the daily numbers be — can a day's total be revised after midnight, or must it be frozen?
How late can late-arriving data realistically be — minutes, hours, or up to several days — and is there a cutoff after which late events are simply dropped or routed to a correction job?
What is the deduplication key? Does every event carry a stable client-generated unique id, or do we have to synthesize one from (user, event-type, timestamp, properties)?
Which metrics are required (unique counts like DAU vs. additive counts like total events vs. funnels), since unique-count metrics constrain the storage/aggregation design more than additive ones?
What are the retention and compliance requirements (raw event retention, PII handling, deletion requests) that the storage tier must honor?
What is the acceptable cost envelope, and is the team optimizing for lowest latency, lowest cost, or operational simplicity?

What a Strong Answer Covers Premium

Follow-up Questions

Your streaming path produced a DAU number at 11:59pm; at 3am the next day a batch of device events from yesterday arrive. Walk through exactly how the day's DAU gets corrected and how downstream consumers learn the number changed.
Compare doing exactly-once in Flink (checkpoint barriers + two-phase-commit sinks) vs. Spark Structured Streaming (checkpointed offsets + idempotent/foreachBatch upserts). When would you pick each, and where can each still produce duplicates?
A single very-hot key (one viral piece of content, or a bot) is skewing one partition and causing lag on just that partition. How do you detect and fix this without re-partitioning the whole topic?
The business now wants accurate unique counts (DAU/MAU) across very high cardinality at low cost and low latency. How does that change your aggregation strategy (e.g., exact sets vs. HyperLogLog), and what accuracy trade-off do you accept?

You should specifically address three operational realities of clickstream telemetry:

Lag and load: how you detect that the pipeline is falling behind and how you keep it caught up as traffic spikes.
Late-arriving data: a user's device goes offline and syncs hours (or days) later, so events arrive long after the time they actually occurred. How do you keep daily aggregates accurate despite this?
Duplicates / exactly-once: network retries and at-least-once delivery mean the same event can appear multiple times. How do you guarantee each event is counted exactly once in your metrics?

Constraints & Assumptions

~1 billion events/day average; assume peaks of roughly 3–5x the average rate (diurnal + launch spikes), so design for a sustained tens-of-thousands of events/sec and bursts above that.
Each event is a small JSON-ish record (user/device id, event type, event timestamp, session id, properties) — order of a few hundred bytes to ~1 KB.
Metrics consumers want: hourly freshness for real-time dashboards (a few minutes of end-to-end lag is acceptable), and accurate, finalized daily aggregates that downstream finance/reporting can trust.
Events carry an event-time (when it happened on the device) that can differ substantially from ingest-time (when the server received it).
Delivery from clients/collectors is at-least-once ; you cannot assume the network delivers each event exactly once.
Cloud-managed building blocks (a distributed log, a stream processor, object storage, a columnar warehouse/OLAP store) are available; you do not need to run bare-metal.

Clarifying Questions to Ask

How fresh must the real-time dashboards be (single-digit minutes vs. tens of minutes), and how "final" must the daily numbers be — can a day's total be revised after midnight, or must it be frozen?
How late can late-arriving data realistically be — minutes, hours, or up to several days — and is there a cutoff after which late events are simply dropped or routed to a correction job?
What is the deduplication key? Does every event carry a stable client-generated unique id, or do we have to synthesize one from (user, event-type, timestamp, properties)?
Which metrics are required (unique counts like DAU vs. additive counts like total events vs. funnels), since unique-count metrics constrain the storage/aggregation design more than additive ones?
What are the retention and compliance requirements (raw event retention, PII handling, deletion requests) that the storage tier must honor?
What is the acceptable cost envelope, and is the team optimizing for lowest latency, lowest cost, or operational simplicity?

What a Strong Answer Covers Premium

Follow-up Questions

Your streaming path produced a DAU number at 11:59pm; at 3am the next day a batch of device events from yesterday arrive. Walk through exactly how the day's DAU gets corrected and how downstream consumers learn the number changed.
Compare doing exactly-once in Flink (checkpoint barriers + two-phase-commit sinks) vs. Spark Structured Streaming (checkpointed offsets + idempotent/foreachBatch upserts). When would you pick each, and where can each still produce duplicates?
A single very-hot key (one viral piece of content, or a bot) is skewing one partition and causing lag on just that partition. How do you detect and fix this without re-partitioning the whole topic?
The business now wants accurate unique counts (DAU/MAU) across very high cardinality at low cost and low latency. How does that change your aggregation strategy (e.g., exact sets vs. HyperLogLog), and what accuracy trade-off do you accept?

Design a Pipeline to Process a Billion Clickstream Events Per Day

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design a Pipeline to Process a Billion Clickstream Events Per Day

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP