PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Disney

Design a clickstream ingestion and aggregation pipeline for billions of events per day

Last updated: Jun 24, 2026

Quick Overview

This system design question tests a data engineer's ability to architect high-throughput, fault-tolerant data pipelines handling extreme event volumes. It evaluates practical knowledge of stream processing concepts such as event-time windowing, watermarks, backpressure, and exactly-once delivery semantics — core competencies for senior data engineering roles.

  • medium
  • Disney
  • System Design
  • Data Engineer

Design a clickstream ingestion and aggregation pipeline for billions of events per day

Company: Disney

Role: Data Engineer

Category: System Design

Difficulty: medium

Interview Round: Take-home Project

## Design a clickstream ingestion and aggregation pipeline for billions of events per day You are designing the data platform for a large consumer streaming/media company. Client applications (web, mobile, smart TVs, set-top boxes) emit **clickstream / telemetry events** — page views, play/pause/seek actions, ad impressions, navigation clicks, and device heartbeats. Design an end-to-end pipeline that **ingests on the order of a billion events per day** and produces **daily aggregates** (for example: daily active users, plays per title, ad impressions per campaign) that downstream analysts and dashboards consume — **without the pipeline falling behind (lagging) as traffic spikes**. Your design should cover the path from the client SDK, through ingestion and stream/batch processing, to the serving/warehouse layer where aggregates are queried. Be explicit about how the system stays caught up under load and how it keeps daily aggregates correct. ```hint Decompose the path Split the problem into clear stages — **collection** (client SDK + edge collector), **buffering** (a durable, partitioned log), **processing** (stream and/or batch), and **serving** (warehouse + a fast query layer). Reason about each stage's throughput and back-pressure independently. ``` ```hint Back-of-envelope first 1 billion events/day is roughly **11.6k events/sec average**, but real traffic is spiky — size for a **peak of 5-10x average** (50-100k events/sec). Estimate event size (~1 KB) to get ingest bandwidth and daily storage, which justifies partitioning and tiered storage choices. ``` ```hint The hard parts are correctness, not just throughput The follow-ups (late data, duplicates, lag) are where strong answers separate. Think **event-time vs processing-time**, **watermarks + allowed lateness**, **idempotent/exactly-once writes**, and **autoscaling driven by consumer lag** — not just "add more machines." ``` ### Constraints & Assumptions - ~1 billion events/day average; design for a **peak of 50-100k events/sec** (5-10x average). - Average event size ~0.5-1 KB JSON/Avro; ~0.5-1 TB/day of raw data. - Daily aggregates should be available within a bounded **freshness SLA** (e.g., hourly rollups within minutes, the finalized daily aggregate within a small number of hours after midnight). - Aggregates must be **accurate**: late-arriving and duplicate events must not corrupt counts. - The system should degrade gracefully under spikes — buffer rather than drop events, and catch back up. - Assume a cloud environment with managed object storage, a managed log/queue, and a columnar warehouse available. ### Clarifying Questions to Ask - What is the freshness SLA — do consumers need near-real-time (seconds/minutes) rollups, or are hourly/daily batch aggregates sufficient? - What is the accuracy requirement — are approximate counts acceptable (e.g., HyperLogLog for distinct users), or must daily aggregates be exact and reconcilable? - What is the expected late-data window — minutes, hours, or can a device sync days later? And once finalized, can an aggregate be revised? - What is the event schema and is it stable, or do we need schema evolution and a registry? - Which aggregations and dimensions matter most (DAU, plays per title, ad impressions per campaign), and what query patterns / cardinality do downstream dashboards have? - What are the cost and retention requirements for raw vs aggregated data? ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - **Late-arriving data**: A user's device goes offline and syncs telemetry hours (or a day) later. How do you keep the daily aggregate accurate? Walk through watermarks, allowed lateness, and what happens to events that arrive *after* the daily aggregate was finalized. - **Deduplication / exactly-once**: Network retries produce duplicate events. What is your strategy for exactly-once processing in Flink or Spark? Contrast idempotent sinks/upserts with transactional (two-phase-commit) sinks, and explain where dedup state lives and how it's bounded. - **Lag under spikes**: Traffic surges 10x during a live event. How do you detect the pipeline is falling behind, and what is your autoscaling and back-pressure strategy so it catches up without dropping data? - **Reprocessing**: A bug in the aggregation logic shipped yesterday. How do you recompute corrected aggregates from the raw event store without disrupting live ingestion?

Quick Answer: This system design question tests a data engineer's ability to architect high-throughput, fault-tolerant data pipelines handling extreme event volumes. It evaluates practical knowledge of stream processing concepts such as event-time windowing, watermarks, backpressure, and exactly-once delivery semantics — core competencies for senior data engineering roles.

Related Interview Questions

  • Design a Pipeline to Process a Billion Clickstream Events Per Day - Disney (medium)
  • Design a Twitter-like microblogging platform - Disney (medium)
  • Design a frontend shopping cart and feature flags - Disney (easy)
  • Design a global multi-game leaderboard - Disney (hard)
  • Design e-commerce browse/search - Disney (hard)
Disney logo
Disney
Mar 27, 2026, 12:00 AM
Data Engineer
Take-home Project
System Design
0
0

Design a clickstream ingestion and aggregation pipeline for billions of events per day

You are designing the data platform for a large consumer streaming/media company. Client applications (web, mobile, smart TVs, set-top boxes) emit clickstream / telemetry events — page views, play/pause/seek actions, ad impressions, navigation clicks, and device heartbeats.

Design an end-to-end pipeline that ingests on the order of a billion events per day and produces daily aggregates (for example: daily active users, plays per title, ad impressions per campaign) that downstream analysts and dashboards consume — without the pipeline falling behind (lagging) as traffic spikes.

Your design should cover the path from the client SDK, through ingestion and stream/batch processing, to the serving/warehouse layer where aggregates are queried. Be explicit about how the system stays caught up under load and how it keeps daily aggregates correct.

Constraints & Assumptions

  • ~1 billion events/day average; design for a peak of 50-100k events/sec (5-10x average).
  • Average event size ~0.5-1 KB JSON/Avro; ~0.5-1 TB/day of raw data.
  • Daily aggregates should be available within a bounded freshness SLA (e.g., hourly rollups within minutes, the finalized daily aggregate within a small number of hours after midnight).
  • Aggregates must be accurate : late-arriving and duplicate events must not corrupt counts.
  • The system should degrade gracefully under spikes — buffer rather than drop events, and catch back up.
  • Assume a cloud environment with managed object storage, a managed log/queue, and a columnar warehouse available.

Clarifying Questions to Ask

  • What is the freshness SLA — do consumers need near-real-time (seconds/minutes) rollups, or are hourly/daily batch aggregates sufficient?
  • What is the accuracy requirement — are approximate counts acceptable (e.g., HyperLogLog for distinct users), or must daily aggregates be exact and reconcilable?
  • What is the expected late-data window — minutes, hours, or can a device sync days later? And once finalized, can an aggregate be revised?
  • What is the event schema and is it stable, or do we need schema evolution and a registry?
  • Which aggregations and dimensions matter most (DAU, plays per title, ad impressions per campaign), and what query patterns / cardinality do downstream dashboards have?
  • What are the cost and retention requirements for raw vs aggregated data?

What a Strong Answer Covers Premium

Follow-up Questions

  • Late-arriving data : A user's device goes offline and syncs telemetry hours (or a day) later. How do you keep the daily aggregate accurate? Walk through watermarks, allowed lateness, and what happens to events that arrive after the daily aggregate was finalized.
  • Deduplication / exactly-once : Network retries produce duplicate events. What is your strategy for exactly-once processing in Flink or Spark? Contrast idempotent sinks/upserts with transactional (two-phase-commit) sinks, and explain where dedup state lives and how it's bounded.
  • Lag under spikes : Traffic surges 10x during a live event. How do you detect the pipeline is falling behind, and what is your autoscaling and back-pressure strategy so it catches up without dropping data?
  • Reprocessing : A bug in the aggregation logic shipped yesterday. How do you recompute corrected aggregates from the raw event store without disrupting live ingestion?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Disney•More Data Engineer•Disney Data Engineer•Disney System Design•Data Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.