How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Take-home Project rounds at Disney.

What role is this question designed for?

This question is commonly asked for Data Engineer candidates at Disney during technical interviews.

Design a clickstream ingestion and aggregation pipeline for billions of events per day

Q: Design a clickstream ingestion and aggregation pipeline for billions of events per day

This system design question tests a data engineer's ability to architect high-throughput, fault-tolerant data pipelines handling extreme event volumes. It evaluates practical knowledge of stream processing concepts such as event-time windowing, watermarks, backpressure, and exactly-once delivery semantics — core competencies for senior data engineering roles.

Design a clickstream ingestion and aggregation pipeline for billions of events per day

You are designing the data platform for a large consumer streaming/media company. Client applications (web, mobile, smart TVs, set-top boxes) emit clickstream / telemetry events — page views, play/pause/seek actions, ad impressions, navigation clicks, and device heartbeats.

Design an end-to-end pipeline that ingests on the order of a billion events per day and produces daily aggregates (for example: daily active users, plays per title, ad impressions per campaign) that downstream analysts and dashboards consume — without the pipeline falling behind (lagging) as traffic spikes.

Your design should cover the path from the client SDK, through ingestion and stream/batch processing, to the serving/warehouse layer where aggregates are queried. Be explicit about how the system stays caught up under load and how it keeps daily aggregates correct.

Constraints & Assumptions

~1 billion events/day average; design for a peak of 50-100k events/sec (5-10x average).
Average event size ~0.5-1 KB JSON/Avro; ~0.5-1 TB/day of raw data.
Daily aggregates should be available within a bounded freshness SLA (e.g., hourly rollups within minutes, the finalized daily aggregate within a small number of hours after midnight).
Aggregates must be accurate : late-arriving and duplicate events must not corrupt counts.
The system should degrade gracefully under spikes — buffer rather than drop events, and catch back up.
Assume a cloud environment with managed object storage, a managed log/queue, and a columnar warehouse available.

Clarifying Questions to Ask

What is the freshness SLA — do consumers need near-real-time (seconds/minutes) rollups, or are hourly/daily batch aggregates sufficient?
What is the accuracy requirement — are approximate counts acceptable (e.g., HyperLogLog for distinct users), or must daily aggregates be exact and reconcilable?
What is the expected late-data window — minutes, hours, or can a device sync days later? And once finalized, can an aggregate be revised?
What is the event schema and is it stable, or do we need schema evolution and a registry?
Which aggregations and dimensions matter most (DAU, plays per title, ad impressions per campaign), and what query patterns / cardinality do downstream dashboards have?
What are the cost and retention requirements for raw vs aggregated data?

What a Strong Answer Covers Premium

Follow-up Questions

Late-arriving data : A user's device goes offline and syncs telemetry hours (or a day) later. How do you keep the daily aggregate accurate? Walk through watermarks, allowed lateness, and what happens to events that arrive after the daily aggregate was finalized.
Deduplication / exactly-once : Network retries produce duplicate events. What is your strategy for exactly-once processing in Flink or Spark? Contrast idempotent sinks/upserts with transactional (two-phase-commit) sinks, and explain where dedup state lives and how it's bounded.
Lag under spikes : Traffic surges 10x during a live event. How do you detect the pipeline is falling behind, and what is your autoscaling and back-pressure strategy so it catches up without dropping data?
Reprocessing : A bug in the aggregation logic shipped yesterday. How do you recompute corrected aggregates from the raw event store without disrupting live ingestion?

Design a clickstream ingestion and aggregation pipeline for billions of events per day

Constraints & Assumptions

~1 billion events/day average; design for a peak of 50-100k events/sec (5-10x average).
Average event size ~0.5-1 KB JSON/Avro; ~0.5-1 TB/day of raw data.
Daily aggregates should be available within a bounded freshness SLA (e.g., hourly rollups within minutes, the finalized daily aggregate within a small number of hours after midnight).
Aggregates must be accurate : late-arriving and duplicate events must not corrupt counts.
The system should degrade gracefully under spikes — buffer rather than drop events, and catch back up.
Assume a cloud environment with managed object storage, a managed log/queue, and a columnar warehouse available.

Clarifying Questions to Ask

What is the freshness SLA — do consumers need near-real-time (seconds/minutes) rollups, or are hourly/daily batch aggregates sufficient?
What is the accuracy requirement — are approximate counts acceptable (e.g., HyperLogLog for distinct users), or must daily aggregates be exact and reconcilable?
What is the expected late-data window — minutes, hours, or can a device sync days later? And once finalized, can an aggregate be revised?
What is the event schema and is it stable, or do we need schema evolution and a registry?
Which aggregations and dimensions matter most (DAU, plays per title, ad impressions per campaign), and what query patterns / cardinality do downstream dashboards have?
What are the cost and retention requirements for raw vs aggregated data?

What a Strong Answer Covers Premium

Follow-up Questions

Late-arriving data : A user's device goes offline and syncs telemetry hours (or a day) later. How do you keep the daily aggregate accurate? Walk through watermarks, allowed lateness, and what happens to events that arrive after the daily aggregate was finalized.
Deduplication / exactly-once : Network retries produce duplicate events. What is your strategy for exactly-once processing in Flink or Spark? Contrast idempotent sinks/upserts with transactional (two-phase-commit) sinks, and explain where dedup state lives and how it's bounded.
Lag under spikes : Traffic surges 10x during a live event. How do you detect the pipeline is falling behind, and what is your autoscaling and back-pressure strategy so it catches up without dropping data?
Reprocessing : A bug in the aggregation logic shipped yesterday. How do you recompute corrected aggregates from the raw event store without disrupting live ingestion?

Design a clickstream ingestion and aggregation pipeline for billions of events per day

Quick Overview

Design a clickstream ingestion and aggregation pipeline for billions of events per day

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design a clickstream ingestion and aggregation pipeline for billions of events per day

Quick Overview

Design a clickstream ingestion and aggregation pipeline for billions of events per day

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP