Design KPI tables, retention, rollups, timezone
Company: Nextdoor
Role: Data Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
You are building a fast KPI dashboard for **Nextagram**. The raw tables are:
- **users**(user_id PK, username, created_at TIMESTAMP)
- **photos**(photo_id PK, user_id, created_at TIMESTAMP)
- **follows**(follower_user_id, followed_user_id, created_at TIMESTAMP)
Assume raw timestamps are stored in **PST** initially.
The dashboard is reported **daily** (one row per report date). Define the daily KPIs as:
- **new_users**: users created on that report date
- **new_users_change_vs_7d_ago**: new_users(today) − new_users(today − 7 days)
- **DAU**: distinct users who performed at least one “active” action that day (define “active” as: posted a photo OR followed someone)
- **DAU_change_vs_7d_ago**
- **new_photos**: photos created on that report date
- **new_photos_per_DAU**: new_photos / DAU
- **new_photos_change_vs_7d_ago**
- **followers**: total number of follow relationships as-of end of that report date (i.e., cumulative count of rows in follows with created_at <= end_of_day)
- **%_users_with_0_followers**: percent of users whose follower count (inbound follows) is 0 as-of end of that date
- **%_users_with_1_2_followers**
- **%_users_with_3plus_followers**
Tasks:
**A) Data model / table design**
- Design the derived table(s) that power this dashboard for a huge dataset (fast reads).
- Specify table schemas (columns, grain, and keys), and what is precomputed.
- List key considerations (partitioning/clustering, incremental loads, idempotency, late-arriving data, correctness vs cost, etc.).
**B) Daily population strategy**
- Describe how you would populate/refresh these tables every day from the raw tables (high-level ETL steps; SQL/pseudocode is acceptable).
**C) Add 1-week retention**
- Add a **1-week retention** metric to the dashboard: for each cohort day D (users who joined on D), what fraction are active on D+7 (using the same “active” definition as DAU).
- Describe required table changes and how you would compute it efficiently.
**D) Switch to weekly rollups**
- The dashboard should support **weekly** rollups instead of daily. Describe what changes in the table design and computation (date grain, definitions of “change vs 7d ago”, handling partial weeks, etc.).
**E) Report in a different timezone**
- A New York office wants metrics reported in **EST** instead of PST. Describe what must change (timestamp storage, date boundaries, ETL windows, backfills, and how to avoid double-counting at day boundaries).
Quick Answer: This question evaluates data engineering and analytical-system design competencies, including derived table schema design, ETL and incremental load strategies, time-series rollups, cohort/retention analysis, aggregation correctness, and timezone-aware event handling, and it sits in the System Design / Data Engineering domain.