PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Disney

Design a Pipeline to Process a Billion Clickstream Events Per Day

Last updated: Jun 24, 2026

Quick Overview

This system design question tests a data engineer's ability to architect high-throughput streaming pipelines handling massive event volumes with strict latency and correctness requirements. It evaluates practical knowledge of distributed stream processing concepts including event-time semantics, watermarking, exactly-once delivery guarantees, and Lambda/Kappa architecture trade-offs.

  • medium
  • Disney
  • System Design
  • Data Engineer

Design a Pipeline to Process a Billion Clickstream Events Per Day

Company: Disney

Role: Data Engineer

Category: System Design

Difficulty: medium

Interview Round: Take-home Project

You are designing the data infrastructure for a large consumer web/mobile product. The product emits a **clickstream**: every page view, tap, scroll, and interaction generates a telemetry event. At peak the product produces on the order of **one billion clickstream events per day**, and the business wants near-real-time dashboards (daily and hourly active users, funnel conversion, feature adoption) plus accurate daily aggregates for finance and analytics. Design an end-to-end pipeline that **ingests, processes, and serves** these billion events per day **without lagging** — i.e., the processing layer must keep up with the incoming event rate so dashboards reflect reality within minutes, not hours. Cover ingestion, the streaming/batch processing layer, storage, and how aggregates are computed and served. You should specifically address three operational realities of clickstream telemetry: 1. **Lag and load:** how you detect that the pipeline is falling behind and how you keep it caught up as traffic spikes. 2. **Late-arriving data:** a user's device goes offline and syncs hours (or days) later, so events arrive long after the time they actually occurred. How do you keep daily aggregates accurate despite this? 3. **Duplicates / exactly-once:** network retries and at-least-once delivery mean the same event can appear multiple times. How do you guarantee each event is counted exactly once in your metrics? ### Constraints & Assumptions - ~1 billion events/day average; assume peaks of roughly 3–5x the average rate (diurnal + launch spikes), so design for a sustained tens-of-thousands of events/sec and bursts above that. - Each event is a small JSON-ish record (user/device id, event type, event timestamp, session id, properties) — order of a few hundred bytes to ~1 KB. - Metrics consumers want: hourly freshness for real-time dashboards (a few minutes of end-to-end lag is acceptable), and **accurate, finalized daily aggregates** that downstream finance/reporting can trust. - Events carry an **event-time** (when it happened on the device) that can differ substantially from **ingest-time** (when the server received it). - Delivery from clients/collectors is **at-least-once**; you cannot assume the network delivers each event exactly once. - Cloud-managed building blocks (a distributed log, a stream processor, object storage, a columnar warehouse/OLAP store) are available; you do not need to run bare-metal. ### Clarifying Questions to Ask - How fresh must the real-time dashboards be (single-digit minutes vs. tens of minutes), and how "final" must the daily numbers be — can a day's total be revised after midnight, or must it be frozen? - How late can late-arriving data realistically be — minutes, hours, or up to several days — and is there a cutoff after which late events are simply dropped or routed to a correction job? - What is the deduplication key? Does every event carry a stable client-generated unique id, or do we have to synthesize one from (user, event-type, timestamp, properties)? - Which metrics are required (unique counts like DAU vs. additive counts like total events vs. funnels), since unique-count metrics constrain the storage/aggregation design more than additive ones? - What are the retention and compliance requirements (raw event retention, PII handling, deletion requests) that the storage tier must honor? - What is the acceptable cost envelope, and is the team optimizing for lowest latency, lowest cost, or operational simplicity? ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - Your streaming path produced a DAU number at 11:59pm; at 3am the next day a batch of device events from yesterday arrive. Walk through exactly how the day's DAU gets corrected and how downstream consumers learn the number changed. - Compare doing exactly-once in **Flink** (checkpoint barriers + two-phase-commit sinks) vs. **Spark Structured Streaming** (checkpointed offsets + idempotent/foreachBatch upserts). When would you pick each, and where can each still produce duplicates? - A single very-hot key (one viral piece of content, or a bot) is skewing one partition and causing lag on just that partition. How do you detect and fix this without re-partitioning the whole topic? - The business now wants accurate **unique** counts (DAU/MAU) across very high cardinality at low cost and low latency. How does that change your aggregation strategy (e.g., exact sets vs. HyperLogLog), and what accuracy trade-off do you accept?

Quick Answer: This system design question tests a data engineer's ability to architect high-throughput streaming pipelines handling massive event volumes with strict latency and correctness requirements. It evaluates practical knowledge of distributed stream processing concepts including event-time semantics, watermarking, exactly-once delivery guarantees, and Lambda/Kappa architecture trade-offs.

Related Interview Questions

  • Design a clickstream ingestion and aggregation pipeline for billions of events per day - Disney (medium)
  • Design a Twitter-like microblogging platform - Disney (medium)
  • Design a frontend shopping cart and feature flags - Disney (easy)
  • Design a global multi-game leaderboard - Disney (hard)
  • Design e-commerce browse/search - Disney (hard)
Disney logo
Disney
Feb 20, 2026, 12:00 AM
Data Engineer
Take-home Project
System Design
0
0

You are designing the data infrastructure for a large consumer web/mobile product. The product emits a clickstream: every page view, tap, scroll, and interaction generates a telemetry event. At peak the product produces on the order of one billion clickstream events per day, and the business wants near-real-time dashboards (daily and hourly active users, funnel conversion, feature adoption) plus accurate daily aggregates for finance and analytics.

Design an end-to-end pipeline that ingests, processes, and serves these billion events per day without lagging — i.e., the processing layer must keep up with the incoming event rate so dashboards reflect reality within minutes, not hours. Cover ingestion, the streaming/batch processing layer, storage, and how aggregates are computed and served.

You should specifically address three operational realities of clickstream telemetry:

  1. Lag and load: how you detect that the pipeline is falling behind and how you keep it caught up as traffic spikes.
  2. Late-arriving data: a user's device goes offline and syncs hours (or days) later, so events arrive long after the time they actually occurred. How do you keep daily aggregates accurate despite this?
  3. Duplicates / exactly-once: network retries and at-least-once delivery mean the same event can appear multiple times. How do you guarantee each event is counted exactly once in your metrics?

Constraints & Assumptions

  • ~1 billion events/day average; assume peaks of roughly 3–5x the average rate (diurnal + launch spikes), so design for a sustained tens-of-thousands of events/sec and bursts above that.
  • Each event is a small JSON-ish record (user/device id, event type, event timestamp, session id, properties) — order of a few hundred bytes to ~1 KB.
  • Metrics consumers want: hourly freshness for real-time dashboards (a few minutes of end-to-end lag is acceptable), and accurate, finalized daily aggregates that downstream finance/reporting can trust.
  • Events carry an event-time (when it happened on the device) that can differ substantially from ingest-time (when the server received it).
  • Delivery from clients/collectors is at-least-once ; you cannot assume the network delivers each event exactly once.
  • Cloud-managed building blocks (a distributed log, a stream processor, object storage, a columnar warehouse/OLAP store) are available; you do not need to run bare-metal.

Clarifying Questions to Ask

  • How fresh must the real-time dashboards be (single-digit minutes vs. tens of minutes), and how "final" must the daily numbers be — can a day's total be revised after midnight, or must it be frozen?
  • How late can late-arriving data realistically be — minutes, hours, or up to several days — and is there a cutoff after which late events are simply dropped or routed to a correction job?
  • What is the deduplication key? Does every event carry a stable client-generated unique id, or do we have to synthesize one from (user, event-type, timestamp, properties)?
  • Which metrics are required (unique counts like DAU vs. additive counts like total events vs. funnels), since unique-count metrics constrain the storage/aggregation design more than additive ones?
  • What are the retention and compliance requirements (raw event retention, PII handling, deletion requests) that the storage tier must honor?
  • What is the acceptable cost envelope, and is the team optimizing for lowest latency, lowest cost, or operational simplicity?

What a Strong Answer Covers Premium

Follow-up Questions

  • Your streaming path produced a DAU number at 11:59pm; at 3am the next day a batch of device events from yesterday arrive. Walk through exactly how the day's DAU gets corrected and how downstream consumers learn the number changed.
  • Compare doing exactly-once in Flink (checkpoint barriers + two-phase-commit sinks) vs. Spark Structured Streaming (checkpointed offsets + idempotent/foreachBatch upserts). When would you pick each, and where can each still produce duplicates?
  • A single very-hot key (one viral piece of content, or a bot) is skewing one partition and causing lag on just that partition. How do you detect and fix this without re-partitioning the whole topic?
  • The business now wants accurate unique counts (DAU/MAU) across very high cardinality at low cost and low latency. How does that change your aggregation strategy (e.g., exact sets vs. HyperLogLog), and what accuracy trade-off do you accept?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Disney•More Data Engineer•Disney Data Engineer•Disney System Design•Data Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.