Design a User Behavior Tracking (Clickstream Analytics) System
Company: Rippling
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
Design a **user behavior tracking system** for a web and mobile product. The system captures user interaction events, transports them reliably, processes them, stores them, and lets product/analytics teams answer questions such as:
- "How many users completed checkout yesterday?"
- "What is the funnel drop-off between the *add to cart* step and the *payment* step?"
- "Show me the ordered sequence of actions a specific user took in the last hour."
Client applications (a JavaScript web SDK, plus iOS and Android SDKs) emit events such as `page_view`, `button_click`, `add_to_cart`, and `purchase`. Each event carries at least: an event name, a user id (or an anonymous device id for logged-out users), a timestamp, a session id, and a free-form JSON property bag (e.g. `{ "page": "/cart", "item_id": "abc", "price": 19.99 }`).
Design the **end-to-end** system: how events are collected from clients, transported, validated/enriched, stored, and queried, plus how it scales, stays reliable, and is observed. Walk through both the **write path** (event ingestion) and the **read path** (powering dashboards and ad-hoc queries).
```hint Where to start
Treat this as a high-throughput, append-heavy **data pipeline**, not a CRUD app. Separate the four stages — **collect → transport → process → store/serve** — and design each independently. A durable log (e.g. Kafka/Kinesis) in the middle decouples bursty client ingestion from slower downstream consumers.
```
```hint Two read patterns, two stores
Notice the queries split into two shapes: (1) **aggregate analytics** over huge time ranges (counts, funnels) and (2) **per-user event lookups**. These have different access patterns — favor a **columnar/OLAP store** (e.g. ClickHouse, Druid, BigQuery) for aggregates and a **wide-row key-value store** keyed by user id for per-user timelines. Don't try to serve both from one schema.
```
```hint Delivery semantics
Clients are unreliable (flaky mobile networks, app kills). Decide explicitly between at-most-once vs at-least-once delivery. At-least-once + a client-generated **idempotency/event id** for downstream dedup is the usual sweet spot — exactly-once end-to-end is expensive and rarely required for analytics.
```
### Constraints & Assumptions
- **Scale (assumed):** ~50M daily active users, ~100 events/user/day ⇒ ~5B events/day ≈ **~58K events/sec average**, with diurnal/peak bursts up to **~300K events/sec**.
- **Event size:** ~1 KB average ⇒ ~5 TB/day of raw event data.
- **Retention:** raw events kept ~30–90 days; rolled-up aggregates kept for years.
- **Latency targets:** event ingestion accepts in <100 ms p99 at the client edge; dashboards/aggregates can be **near-real-time** (fresh within seconds to a couple of minutes); ad-hoc analytical queries may take seconds to minutes.
- **Reliability:** ingestion availability ~99.9%; **at-least-once** delivery is acceptable as long as duplicates can be removed for accurate counts; small, bounded event loss under extreme failure is tolerable but should be measured.
- **Consistency:** analytics is **eventually consistent** — exact real-time accuracy is not required, but counts must converge to correct values.
- Out of scope unless raised: experimentation/AB assignment logic, billing, and GDPR data-subject deletion mechanics (acknowledge them, don't fully design them).
### Clarifying Questions to Ask
- **Freshness vs. cost:** Do dashboards need sub-minute freshness (streaming aggregation) or is hourly/daily batch roll-up acceptable? This drives the entire processing layer.
- **Query surface:** Are queries restricted to a fixed set of pre-defined metrics/funnels, or must we support arbitrary ad-hoc SQL-like exploration over raw events?
- **Identity:** How are anonymous device ids stitched to user ids after login (identity resolution), and do we need to retroactively re-attribute pre-login events?
- **Delivery guarantees:** Is occasional event loss acceptable, or must every event be durably captured? How accurate must counts be (exact vs. approximate, e.g. HyperLogLog)?
- **Schema governance:** Is the event property schema free-form forever, or is there a registry/contract so analysts can trust field types?
- **Privacy/compliance:** What PII may events contain, what's the retention/deletion policy, and do we need regional data residency?
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How do you compute a **funnel conversion rate** (e.g. `view → add_to_cart → purchase` within 24 hours per user) efficiently over billions of events? Compare a streaming approach against a daily batch job.
- A mobile client is offline for two hours, then flushes a backlog of events with old timestamps. How does your processing layer produce **correct time-bucketed aggregates** despite this late, out-of-order data?
- A buggy app release suddenly doubles event volume and starts sending malformed payloads. Walk through how the system **absorbs the load and isolates the bad data** without corrupting downstream metrics or dropping good events.
- How would you support **arbitrary ad-hoc queries** from analysts without letting an expensive query take down the cluster serving production dashboards?
Quick Answer: This question asks for the design of an end-to-end user behavior tracking (clickstream analytics) system, covering event collection, transport, processing, storage, and querying. It evaluates system design skills around high-throughput data pipelines, including trade-offs between streaming and batch processing, storage models for aggregate versus per-user access patterns, and delivery guarantees under unreliable clients.