Design a clickstream ingestion and aggregation pipeline for billions of events per day
Company: Disney
Role: Data Engineer
Category: System Design
Difficulty: medium
Interview Round: Take-home Project
## Design a clickstream ingestion and aggregation pipeline for billions of events per day
You are designing the data platform for a large consumer streaming/media company. Client applications (web, mobile, smart TVs, set-top boxes) emit **clickstream / telemetry events** — page views, play/pause/seek actions, ad impressions, navigation clicks, and device heartbeats.
Design an end-to-end pipeline that **ingests on the order of a billion events per day** and produces **daily aggregates** (for example: daily active users, plays per title, ad impressions per campaign) that downstream analysts and dashboards consume — **without the pipeline falling behind (lagging) as traffic spikes**.
Your design should cover the path from the client SDK, through ingestion and stream/batch processing, to the serving/warehouse layer where aggregates are queried. Be explicit about how the system stays caught up under load and how it keeps daily aggregates correct.
```hint Decompose the path
Split the problem into clear stages — **collection** (client SDK + edge collector), **buffering** (a durable, partitioned log), **processing** (stream and/or batch), and **serving** (warehouse + a fast query layer). Reason about each stage's throughput and back-pressure independently.
```
```hint Back-of-envelope first
1 billion events/day is roughly **11.6k events/sec average**, but real traffic is spiky — size for a **peak of 5-10x average** (50-100k events/sec). Estimate event size (~1 KB) to get ingest bandwidth and daily storage, which justifies partitioning and tiered storage choices.
```
```hint The hard parts are correctness, not just throughput
The follow-ups (late data, duplicates, lag) are where strong answers separate. Think **event-time vs processing-time**, **watermarks + allowed lateness**, **idempotent/exactly-once writes**, and **autoscaling driven by consumer lag** — not just "add more machines."
```
### Constraints & Assumptions
- ~1 billion events/day average; design for a **peak of 50-100k events/sec** (5-10x average).
- Average event size ~0.5-1 KB JSON/Avro; ~0.5-1 TB/day of raw data.
- Daily aggregates should be available within a bounded **freshness SLA** (e.g., hourly rollups within minutes, the finalized daily aggregate within a small number of hours after midnight).
- Aggregates must be **accurate**: late-arriving and duplicate events must not corrupt counts.
- The system should degrade gracefully under spikes — buffer rather than drop events, and catch back up.
- Assume a cloud environment with managed object storage, a managed log/queue, and a columnar warehouse available.
### Clarifying Questions to Ask
- What is the freshness SLA — do consumers need near-real-time (seconds/minutes) rollups, or are hourly/daily batch aggregates sufficient?
- What is the accuracy requirement — are approximate counts acceptable (e.g., HyperLogLog for distinct users), or must daily aggregates be exact and reconcilable?
- What is the expected late-data window — minutes, hours, or can a device sync days later? And once finalized, can an aggregate be revised?
- What is the event schema and is it stable, or do we need schema evolution and a registry?
- Which aggregations and dimensions matter most (DAU, plays per title, ad impressions per campaign), and what query patterns / cardinality do downstream dashboards have?
- What are the cost and retention requirements for raw vs aggregated data?
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- **Late-arriving data**: A user's device goes offline and syncs telemetry hours (or a day) later. How do you keep the daily aggregate accurate? Walk through watermarks, allowed lateness, and what happens to events that arrive *after* the daily aggregate was finalized.
- **Deduplication / exactly-once**: Network retries produce duplicate events. What is your strategy for exactly-once processing in Flink or Spark? Contrast idempotent sinks/upserts with transactional (two-phase-commit) sinks, and explain where dedup state lives and how it's bounded.
- **Lag under spikes**: Traffic surges 10x during a live event. How do you detect the pipeline is falling behind, and what is your autoscaling and back-pressure strategy so it catches up without dropping data?
- **Reprocessing**: A bug in the aggregation logic shipped yesterday. How do you recompute corrected aggregates from the raw event store without disrupting live ingestion?
Quick Answer: This system design question tests a data engineer's ability to architect high-throughput, fault-tolerant data pipelines handling extreme event volumes. It evaluates practical knowledge of stream processing concepts such as event-time windowing, watermarks, backpressure, and exactly-once delivery semantics — core competencies for senior data engineering roles.