PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Rippling

Design a User Behavior Tracking (Clickstream Analytics) System

Last updated: Jul 1, 2026

Quick Overview

This question asks for the design of an end-to-end user behavior tracking (clickstream analytics) system, covering event collection, transport, processing, storage, and querying. It evaluates system design skills around high-throughput data pipelines, including trade-offs between streaming and batch processing, storage models for aggregate versus per-user access patterns, and delivery guarantees under unreliable clients.

  • medium
  • Rippling
  • System Design
  • Software Engineer

Design a User Behavior Tracking (Clickstream Analytics) System

Company: Rippling

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design a **user behavior tracking system** for a web and mobile product. The system captures user interaction events, transports them reliably, processes them, stores them, and lets product/analytics teams answer questions such as: - "How many users completed checkout yesterday?" - "What is the funnel drop-off between the *add to cart* step and the *payment* step?" - "Show me the ordered sequence of actions a specific user took in the last hour." Client applications (a JavaScript web SDK, plus iOS and Android SDKs) emit events such as `page_view`, `button_click`, `add_to_cart`, and `purchase`. Each event carries at least: an event name, a user id (or an anonymous device id for logged-out users), a timestamp, a session id, and a free-form JSON property bag (e.g. `{ "page": "/cart", "item_id": "abc", "price": 19.99 }`). Design the **end-to-end** system: how events are collected from clients, transported, validated/enriched, stored, and queried, plus how it scales, stays reliable, and is observed. Walk through both the **write path** (event ingestion) and the **read path** (powering dashboards and ad-hoc queries). ```hint Where to start Treat this as a high-throughput, append-heavy **data pipeline**, not a CRUD app. Separate the four stages — **collect → transport → process → store/serve** — and design each independently. A durable log (e.g. Kafka/Kinesis) in the middle decouples bursty client ingestion from slower downstream consumers. ``` ```hint Two read patterns, two stores Notice the queries split into two shapes: (1) **aggregate analytics** over huge time ranges (counts, funnels) and (2) **per-user event lookups**. These have different access patterns — favor a **columnar/OLAP store** (e.g. ClickHouse, Druid, BigQuery) for aggregates and a **wide-row key-value store** keyed by user id for per-user timelines. Don't try to serve both from one schema. ``` ```hint Delivery semantics Clients are unreliable (flaky mobile networks, app kills). Decide explicitly between at-most-once vs at-least-once delivery. At-least-once + a client-generated **idempotency/event id** for downstream dedup is the usual sweet spot — exactly-once end-to-end is expensive and rarely required for analytics. ``` ### Constraints & Assumptions - **Scale (assumed):** ~50M daily active users, ~100 events/user/day ⇒ ~5B events/day ≈ **~58K events/sec average**, with diurnal/peak bursts up to **~300K events/sec**. - **Event size:** ~1 KB average ⇒ ~5 TB/day of raw event data. - **Retention:** raw events kept ~30–90 days; rolled-up aggregates kept for years. - **Latency targets:** event ingestion accepts in <100 ms p99 at the client edge; dashboards/aggregates can be **near-real-time** (fresh within seconds to a couple of minutes); ad-hoc analytical queries may take seconds to minutes. - **Reliability:** ingestion availability ~99.9%; **at-least-once** delivery is acceptable as long as duplicates can be removed for accurate counts; small, bounded event loss under extreme failure is tolerable but should be measured. - **Consistency:** analytics is **eventually consistent** — exact real-time accuracy is not required, but counts must converge to correct values. - Out of scope unless raised: experimentation/AB assignment logic, billing, and GDPR data-subject deletion mechanics (acknowledge them, don't fully design them). ### Clarifying Questions to Ask - **Freshness vs. cost:** Do dashboards need sub-minute freshness (streaming aggregation) or is hourly/daily batch roll-up acceptable? This drives the entire processing layer. - **Query surface:** Are queries restricted to a fixed set of pre-defined metrics/funnels, or must we support arbitrary ad-hoc SQL-like exploration over raw events? - **Identity:** How are anonymous device ids stitched to user ids after login (identity resolution), and do we need to retroactively re-attribute pre-login events? - **Delivery guarantees:** Is occasional event loss acceptable, or must every event be durably captured? How accurate must counts be (exact vs. approximate, e.g. HyperLogLog)? - **Schema governance:** Is the event property schema free-form forever, or is there a registry/contract so analysts can trust field types? - **Privacy/compliance:** What PII may events contain, what's the retention/deletion policy, and do we need regional data residency? ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - How do you compute a **funnel conversion rate** (e.g. `view → add_to_cart → purchase` within 24 hours per user) efficiently over billions of events? Compare a streaming approach against a daily batch job. - A mobile client is offline for two hours, then flushes a backlog of events with old timestamps. How does your processing layer produce **correct time-bucketed aggregates** despite this late, out-of-order data? - A buggy app release suddenly doubles event volume and starts sending malformed payloads. Walk through how the system **absorbs the load and isolates the bad data** without corrupting downstream metrics or dropping good events. - How would you support **arbitrary ad-hoc queries** from analysts without letting an expensive query take down the cluster serving production dashboards?

Quick Answer: This question asks for the design of an end-to-end user behavior tracking (clickstream analytics) system, covering event collection, transport, processing, storage, and querying. It evaluates system design skills around high-throughput data pipelines, including trade-offs between streaming and batch processing, storage models for aggregate versus per-user access patterns, and delivery guarantees under unreliable clients.

Related Interview Questions

  • Design a News Aggregation System (Google News-style) - Rippling (medium)
  • Prevent Duplicate Payments Under High Load - Rippling
  • Design a personalized news aggregator - Rippling (medium)
  • Design a Scalable News Feed - Rippling (medium)
  • Design Scalable Expense Violation Processing - Rippling (hard)
|Home/System Design/Rippling

Design a User Behavior Tracking (Clickstream Analytics) System

Rippling logo
Rippling
Jun 23, 2026, 12:00 AM
mediumSoftware EngineerTechnical ScreenSystem Design
0
0

Design a user behavior tracking system for a web and mobile product. The system captures user interaction events, transports them reliably, processes them, stores them, and lets product/analytics teams answer questions such as:

  • "How many users completed checkout yesterday?"
  • "What is the funnel drop-off between the add to cart step and the payment step?"
  • "Show me the ordered sequence of actions a specific user took in the last hour."

Client applications (a JavaScript web SDK, plus iOS and Android SDKs) emit events such as page_view, button_click, add_to_cart, and purchase. Each event carries at least: an event name, a user id (or an anonymous device id for logged-out users), a timestamp, a session id, and a free-form JSON property bag (e.g. { "page": "/cart", "item_id": "abc", "price": 19.99 }).

Design the end-to-end system: how events are collected from clients, transported, validated/enriched, stored, and queried, plus how it scales, stays reliable, and is observed. Walk through both the write path (event ingestion) and the read path (powering dashboards and ad-hoc queries).

Constraints & Assumptions

  • Scale (assumed): ~50M daily active users, ~100 events/user/day ⇒ ~5B events/day ≈ ~58K events/sec average , with diurnal/peak bursts up to ~300K events/sec .
  • Event size: ~1 KB average ⇒ ~5 TB/day of raw event data.
  • Retention: raw events kept ~30–90 days; rolled-up aggregates kept for years.
  • Latency targets: event ingestion accepts in <100 ms p99 at the client edge; dashboards/aggregates can be near-real-time (fresh within seconds to a couple of minutes); ad-hoc analytical queries may take seconds to minutes.
  • Reliability: ingestion availability ~99.9%; at-least-once delivery is acceptable as long as duplicates can be removed for accurate counts; small, bounded event loss under extreme failure is tolerable but should be measured.
  • Consistency: analytics is eventually consistent — exact real-time accuracy is not required, but counts must converge to correct values.
  • Out of scope unless raised: experimentation/AB assignment logic, billing, and GDPR data-subject deletion mechanics (acknowledge them, don't fully design them).

Clarifying Questions to Ask

  • Freshness vs. cost: Do dashboards need sub-minute freshness (streaming aggregation) or is hourly/daily batch roll-up acceptable? This drives the entire processing layer.
  • Query surface: Are queries restricted to a fixed set of pre-defined metrics/funnels, or must we support arbitrary ad-hoc SQL-like exploration over raw events?
  • Identity: How are anonymous device ids stitched to user ids after login (identity resolution), and do we need to retroactively re-attribute pre-login events?
  • Delivery guarantees: Is occasional event loss acceptable, or must every event be durably captured? How accurate must counts be (exact vs. approximate, e.g. HyperLogLog)?
  • Schema governance: Is the event property schema free-form forever, or is there a registry/contract so analysts can trust field types?
  • Privacy/compliance: What PII may events contain, what's the retention/deletion policy, and do we need regional data residency?

What a Strong Answer Covers Premium

Follow-up Questions

  • How do you compute a funnel conversion rate (e.g. view → add_to_cart → purchase within 24 hours per user) efficiently over billions of events? Compare a streaming approach against a daily batch job.
  • A mobile client is offline for two hours, then flushes a backlog of events with old timestamps. How does your processing layer produce correct time-bucketed aggregates despite this late, out-of-order data?
  • A buggy app release suddenly doubles event volume and starts sending malformed payloads. Walk through how the system absorbs the load and isolates the bad data without corrupting downstream metrics or dropping good events.
  • How would you support arbitrary ad-hoc queries from analysts without letting an expensive query take down the cluster serving production dashboards?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Rippling•More Software Engineer•Rippling Software Engineer•Rippling System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.