You are building a fast KPI dashboard for Nextagram. The raw tables are:
-
users
(user_id PK, username, created_at TIMESTAMP)
-
photos
(photo_id PK, user_id, created_at TIMESTAMP)
-
follows
(follower_user_id, followed_user_id, created_at TIMESTAMP)
Assume raw timestamps are stored in PST initially.
The dashboard is reported daily (one row per report date). Define the daily KPIs as:
-
new_users
: users created on that report date
-
new_users_change_vs_7d_ago
: new_users(today) − new_users(today − 7 days)
-
DAU
: distinct users who performed at least one “active” action that day (define “active” as: posted a photo OR followed someone)
-
DAU_change_vs_7d_ago
-
new_photos
: photos created on that report date
-
new_photos_per_DAU
: new_photos / DAU
-
new_photos_change_vs_7d_ago
-
followers
: total number of follow relationships as-of end of that report date (i.e., cumulative count of rows in follows with created_at <= end_of_day)
-
%_users_with_0_followers
: percent of users whose follower count (inbound follows) is 0 as-of end of that date
-
%_users_with_1_2_followers
-
%_users_with_3plus_followers
Tasks:
A) Data model / table design
-
Design the derived table(s) that power this dashboard for a huge dataset (fast reads).
-
Specify table schemas (columns, grain, and keys), and what is precomputed.
-
List key considerations (partitioning/clustering, incremental loads, idempotency, late-arriving data, correctness vs cost, etc.).
B) Daily population strategy
-
Describe how you would populate/refresh these tables every day from the raw tables (high-level ETL steps; SQL/pseudocode is acceptable).
C) Add 1-week retention
-
Add a
1-week retention
metric to the dashboard: for each cohort day D (users who joined on D), what fraction are active on D+7 (using the same “active” definition as DAU).
-
Describe required table changes and how you would compute it efficiently.
D) Switch to weekly rollups
-
The dashboard should support
weekly
rollups instead of daily. Describe what changes in the table design and computation (date grain, definitions of “change vs 7d ago”, handling partial weeks, etc.).
E) Report in a different timezone
-
A New York office wants metrics reported in
EST
instead of PST. Describe what must change (timestamp storage, date boundaries, ETL windows, backfills, and how to avoid double-counting at day boundaries).