Design a Pipeline to Process a Billion Clickstream Events Per Day
Company: Disney
Role: Data Engineer
Category: System Design
Difficulty: medium
Interview Round: Take-home Project
You are designing the data infrastructure for a large consumer web/mobile product. The product emits a **clickstream**: every page view, tap, scroll, and interaction generates a telemetry event. At peak the product produces on the order of **one billion clickstream events per day**, and the business wants near-real-time dashboards (daily and hourly active users, funnel conversion, feature adoption) plus accurate daily aggregates for finance and analytics.
Design an end-to-end pipeline that **ingests, processes, and serves** these billion events per day **without lagging** — i.e., the processing layer must keep up with the incoming event rate so dashboards reflect reality within minutes, not hours. Cover ingestion, the streaming/batch processing layer, storage, and how aggregates are computed and served.
You should specifically address three operational realities of clickstream telemetry:
1. **Lag and load:** how you detect that the pipeline is falling behind and how you keep it caught up as traffic spikes.
2. **Late-arriving data:** a user's device goes offline and syncs hours (or days) later, so events arrive long after the time they actually occurred. How do you keep daily aggregates accurate despite this?
3. **Duplicates / exactly-once:** network retries and at-least-once delivery mean the same event can appear multiple times. How do you guarantee each event is counted exactly once in your metrics?
### Constraints & Assumptions
- ~1 billion events/day average; assume peaks of roughly 3–5x the average rate (diurnal + launch spikes), so design for a sustained tens-of-thousands of events/sec and bursts above that.
- Each event is a small JSON-ish record (user/device id, event type, event timestamp, session id, properties) — order of a few hundred bytes to ~1 KB.
- Metrics consumers want: hourly freshness for real-time dashboards (a few minutes of end-to-end lag is acceptable), and **accurate, finalized daily aggregates** that downstream finance/reporting can trust.
- Events carry an **event-time** (when it happened on the device) that can differ substantially from **ingest-time** (when the server received it).
- Delivery from clients/collectors is **at-least-once**; you cannot assume the network delivers each event exactly once.
- Cloud-managed building blocks (a distributed log, a stream processor, object storage, a columnar warehouse/OLAP store) are available; you do not need to run bare-metal.
### Clarifying Questions to Ask
- How fresh must the real-time dashboards be (single-digit minutes vs. tens of minutes), and how "final" must the daily numbers be — can a day's total be revised after midnight, or must it be frozen?
- How late can late-arriving data realistically be — minutes, hours, or up to several days — and is there a cutoff after which late events are simply dropped or routed to a correction job?
- What is the deduplication key? Does every event carry a stable client-generated unique id, or do we have to synthesize one from (user, event-type, timestamp, properties)?
- Which metrics are required (unique counts like DAU vs. additive counts like total events vs. funnels), since unique-count metrics constrain the storage/aggregation design more than additive ones?
- What are the retention and compliance requirements (raw event retention, PII handling, deletion requests) that the storage tier must honor?
- What is the acceptable cost envelope, and is the team optimizing for lowest latency, lowest cost, or operational simplicity?
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- Your streaming path produced a DAU number at 11:59pm; at 3am the next day a batch of device events from yesterday arrive. Walk through exactly how the day's DAU gets corrected and how downstream consumers learn the number changed.
- Compare doing exactly-once in **Flink** (checkpoint barriers + two-phase-commit sinks) vs. **Spark Structured Streaming** (checkpointed offsets + idempotent/foreachBatch upserts). When would you pick each, and where can each still produce duplicates?
- A single very-hot key (one viral piece of content, or a bot) is skewing one partition and causing lag on just that partition. How do you detect and fix this without re-partitioning the whole topic?
- The business now wants accurate **unique** counts (DAU/MAU) across very high cardinality at low cost and low latency. How does that change your aggregation strategy (e.g., exact sets vs. HyperLogLog), and what accuracy trade-off do you accept?
Quick Answer: This system design question tests a data engineer's ability to architect high-throughput streaming pipelines handling massive event volumes with strict latency and correctness requirements. It evaluates practical knowledge of distributed stream processing concepts including event-time semantics, watermarking, exactly-once delivery guarantees, and Lambda/Kappa architecture trade-offs.