Write SQL for app metrics
Company: Nextdoor
Role: Data Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Technical Screen
Quick Answer: This question evaluates proficiency in SQL-based data engineering, focusing on time-zone-aware timestamp handling, aggregation of event and relationship data, cohort retention analysis, and computation of product metrics like DAU, new users, photo activity, and follower distributions.
Part 1: Count the Total Number of Users
Constraints
- 0 <= len(users) <= 200000
- Each row has exactly 3 fields
- user_id is a non-empty string
- joined_at_utc is a valid ISO 8601 UTC timestamp ending in 'Z'
Examples
Input: ([('u1', 'Ana', '2024-01-01T00:00:00Z'), ('u2', 'Bob', '2024-01-02T00:00:00Z'), ('u3', 'Cam', '2024-01-03T00:00:00Z')],)
Expected Output: 3
Explanation: There are 3 different user IDs.
Input: ([('u1', 'Ana', '2024-01-01T00:00:00Z'), ('u1', 'Ana', '2024-01-01T00:00:00Z'), ('u2', 'Bob', '2024-01-02T00:00:00Z')],)
Expected Output: 2
Explanation: The duplicate row for u1 should only be counted once.
Input: ([],)
Expected Output: 0
Explanation: No rows means no users.
Input: ([('solo', 'OnlyUser', '2024-05-10T12:00:00Z')],)
Expected Output: 1
Explanation: A single row represents one distinct user.
Hints
- You only need the first field of each row.
- A set is useful when you need to count unique identifiers.
Part 2: Calendar Month with the Most New User Signups
Constraints
- 0 <= len(users) <= 200000
- Each timestamp is a valid ISO 8601 UTC timestamp ending in 'Z'
- Use the America/Los_Angeles timezone before extracting the calendar month
Examples
Input: ([('u1', 'Ana', '2024-01-01T10:00:00Z'), ('u2', 'Bob', '2024-01-15T12:00:00Z'), ('u3', 'Cam', '2024-02-01T09:00:00Z'), ('u4', 'Dan', '2024-02-20T18:00:00Z'), ('u5', 'Eli', '2024-02-28T23:00:00Z')],)
Expected Output: '2024-02'
Explanation: January has 2 signups and February has 3.
Input: ([('u1', 'A', '2024-01-01T07:30:00Z'), ('u2', 'B', '2024-01-31T23:00:00Z'), ('u3', 'C', '2024-02-01T07:30:00Z'), ('u4', 'D', '2024-02-15T12:00:00Z'), ('u5', 'E', '2024-03-01T07:30:00Z')],)
Expected Output: '2024-01'
Explanation: After timezone conversion, counts are 2023-12: 1, 2024-01: 2, 2024-02: 2. The tie is broken by earliest month.
Input: ([],)
Expected Output: ''
Explanation: No users means no winning month.
Input: ([('u1', 'A', '2024-01-01T00:30:00Z')],)
Expected Output: '2023-12'
Explanation: 2024-01-01 00:30 UTC is still 2023-12-31 in Los Angeles.
Hints
- Convert each timestamp to local time before truncating to month.
- Count signups per month, then apply the tie-break rule by earliest month.
Part 3: Find the User with the Largest Follower Count
Constraints
- 0 <= len(users) <= 200000
- 0 <= len(follows) <= 200000
- Count distinct follower_user_id values per followed_user_id
- Timestamps are provided but are not needed for this task
Examples
Input: ([('u1', 'A', '2024-01-01T00:00:00Z'), ('u2', 'B', '2024-01-01T00:00:00Z'), ('u3', 'C', '2024-01-01T00:00:00Z')], [('u2', 'u1', '2024-01-02T00:00:00Z'), ('u3', 'u1', '2024-01-02T00:00:00Z'), ('u1', 'u2', '2024-01-03T00:00:00Z')])
Expected Output: 'u1'
Explanation: u1 has 2 distinct followers, more than any other user.
Input: ([('u1', 'A', '2024-01-01T00:00:00Z'), ('u2', 'B', '2024-01-01T00:00:00Z'), ('u3', 'C', '2024-01-01T00:00:00Z')], [('u2', 'u1', '2024-01-02T00:00:00Z'), ('u3', 'u1', '2024-01-02T00:00:00Z'), ('u1', 'u2', '2024-01-03T00:00:00Z'), ('u3', 'u2', '2024-01-03T00:00:00Z')])
Expected Output: 'u1'
Explanation: u1 and u2 both have 2 followers, so the lexicographically smaller ID wins.
Input: ([('u2', 'B', '2024-01-01T00:00:00Z'), ('u1', 'A', '2024-01-01T00:00:00Z')], [])
Expected Output: 'u1'
Explanation: Both users have 0 followers, so return the smaller user_id.
Input: ([('u1', 'A', '2024-01-01T00:00:00Z'), ('u2', 'B', '2024-01-01T00:00:00Z'), ('u3', 'C', '2024-01-01T00:00:00Z')], [('u2', 'u1', '2024-01-02T00:00:00Z'), ('u2', 'u1', '2024-01-03T00:00:00Z'), ('u3', 'u1', '2024-01-04T00:00:00Z')])
Expected Output: 'u1'
Explanation: Duplicate follows from the same follower count only once because followers are distinct.
Input: ([], [('u1', 'u2', '2024-01-01T00:00:00Z')])
Expected Output: ''
Explanation: No users means no valid winner.
Hints
- Build a set of followers for each followed user.
- Do not forget users with zero followers; one of them could win if there are no follow rows.
Part 4: Populate Daily Aggregate Tables for Dashboard Metrics
Constraints
- 0 <= len(users), len(photos), len(follows), len(user_events) <= 200000
- start_date <= end_date
- All timestamps are valid ISO 8601 UTC timestamps ending in 'Z'
- The number of days in the requested date range is at most 366
- Assume user IDs in users are unique and follow rows represent active edges that remain active forever
Examples
Input: ([('u1', 'A', '2024-01-01T10:00:00Z'), ('u2', 'B', '2024-01-02T07:30:00Z'), ('u3', 'C', '2024-01-03T08:30:00Z')], [('p1', 'u1', '2024-01-02T06:00:00Z'), ('p2', 'u2', '2024-01-03T09:00:00Z')], [('u2', 'u1', '2024-01-02T08:00:00Z'), ('u3', 'u1', '2024-01-03T18:00:00Z')], [('e1', 'u1', 'open', '2024-01-02T01:00:00Z'), ('e2', 'u1', 'like', '2024-01-02T23:00:00Z'), ('e3', 'u2', 'open', '2024-01-03T07:20:00Z'), ('e4', 'u2', 'open', '2024-01-03T09:00:00Z')], '2024-01-01', '2024-01-03')
Expected Output: {'kpi_daily_base': [['2024-01-01', 2, 2, 1, 1, 0], ['2024-01-02', 0, 2, 2, 0, 1], ['2024-01-03', 1, 3, 1, 1, 2]], 'follower_bucket_daily': [['2024-01-01', '0', 2], ['2024-01-01', '1_2', 0], ['2024-01-01', '3_plus', 0], ['2024-01-02', '0', 1], ['2024-01-02', '1_2', 1], ['2024-01-02', '3_plus', 0], ['2024-01-03', '0', 2], ['2024-01-03', '1_2', 1], ['2024-01-03', '3_plus', 0]]}
Explanation: This covers timezone conversion, DAU deduplication, cumulative users, cumulative follows, and bucket updates.
Input: ([], [], [], [], '2024-05-01', '2024-05-01')
Expected Output: {'kpi_daily_base': [['2024-05-01', 0, 0, 0, 0, 0]], 'follower_bucket_daily': [['2024-05-01', '0', 0], ['2024-05-01', '1_2', 0], ['2024-05-01', '3_plus', 0]]}
Explanation: The date still appears even when all raw tables are empty.
Input: ([('u1', 'A', '2024-02-01T09:00:00Z'), ('u2', 'B', '2024-02-01T10:00:00Z'), ('u3', 'C', '2024-02-01T11:00:00Z'), ('u4', 'D', '2024-02-01T12:00:00Z')], [], [('u2', 'u1', '2024-02-01T20:00:00Z'), ('u3', 'u1', '2024-02-01T21:00:00Z'), ('u4', 'u1', '2024-02-02T18:00:00Z')], [], '2024-02-01', '2024-02-02')
Expected Output: {'kpi_daily_base': [['2024-02-01', 4, 4, 0, 0, 2], ['2024-02-02', 0, 4, 0, 0, 3]], 'follower_bucket_daily': [['2024-02-01', '0', 3], ['2024-02-01', '1_2', 1], ['2024-02-01', '3_plus', 0], ['2024-02-02', '0', 3], ['2024-02-02', '1_2', 0], ['2024-02-02', '3_plus', 1]]}
Explanation: u1 moves from bucket '1_2' to '3_plus' on the second day.
Input: ([('u1', 'A', '2024-01-01T12:00:00Z'), ('u2', 'B', '2024-01-03T12:00:00Z')], [], [('u2', 'u1', '2024-01-02T12:00:00Z')], [], '2024-01-03', '2024-01-04')
Expected Output: {'kpi_daily_base': [['2024-01-03', 1, 2, 0, 0, 1], ['2024-01-04', 0, 2, 0, 0, 1]], 'follower_bucket_daily': [['2024-01-03', '0', 1], ['2024-01-03', '1_2', 1], ['2024-01-03', '3_plus', 0], ['2024-01-04', '0', 1], ['2024-01-04', '1_2', 1], ['2024-01-04', '3_plus', 0]]}
Explanation: Data before start_date must still affect cumulative end-of-day counts.
Hints
- Precompute the local date for every raw row first.
- A daily sweep with cumulative totals is simpler than recomputing each day from scratch.
Part 5: Compute Week-1 Retention Cohorts
Constraints
- 0 <= len(users), len(user_events) <= 200000
- All timestamps are valid ISO 8601 UTC timestamps ending in 'Z'
- Use the America/Los_Angeles timezone before determining dates
- A user should be counted at most once in retained_w1_users for their cohort
Examples
Input: ([('u1', 'A', '2024-01-01T18:00:00Z'), ('u2', 'B', '2024-01-02T07:30:00Z'), ('u3', 'C', '2024-01-03T08:30:00Z')], [('e1', 'u1', 'open', '2024-01-08T20:00:00Z'), ('e2', 'u1', 'like', '2024-01-15T20:00:00Z'), ('e3', 'u2', 'open', '2024-01-14T07:30:00Z'), ('e4', 'u3', 'open', '2024-01-10T10:00:00Z')])
Expected Output: [['2024-01-01', 2, 2], ['2024-01-03', 1, 1]]
Explanation: u1 and u2 are both retained for the 2024-01-01 cohort, and u3 is retained for the 2024-01-03 cohort.
Input: ([('u1', 'A', '2024-01-01T12:00:00Z'), ('u2', 'B', '2024-01-01T13:00:00Z')], [('e1', 'u1', 'open', '2024-01-07T20:00:00Z'), ('e2', 'u2', 'open', '2024-01-15T08:30:00Z')])
Expected Output: [['2024-01-01', 2, 0]]
Explanation: One event is too early and the other is too late, so neither user is retained.
Input: ([('u1', 'A', '2024-02-01T18:00:00Z')], [('e1', 'u1', 'open', '2024-02-08T18:00:00Z'), ('e2', 'u1', 'like', '2024-02-09T18:00:00Z')])
Expected Output: [['2024-02-01', 1, 1]]
Explanation: Multiple qualifying events still count as one retained user.
Input: ([], [('e1', 'u1', 'open', '2024-01-08T18:00:00Z')])
Expected Output: []
Explanation: No users means no cohorts.
Hints
- Group users by local signup date first.
- For each user, storing the set of local event dates is enough; multiple events on the same day do not change retention.