Write SQL for session analytics
Company: Shopify
Role: Data Engineer
Category: Coding & Algorithms
Difficulty: easy
Interview Round: Technical Screen
Quick Answer: This question evaluates SQL proficiency and data engineering competencies, including deduplication, joins, time-series aggregation, rolling-window calculations, calendar generation, and URL query-parameter extraction.
Part 1: Daily Total Sessions
Constraints
- 0 <= len(sessions) <= 200000
- `started_at` is always a valid timestamp string in `YYYY-MM-DD HH:MM:SS` format
- Multiple sessions may occur on the same day
Examples
Input: ([('s1', 1, '2024-01-01 09:00:00', '/a'), ('s2', 1, '2024-01-01 10:15:00', '/b'), ('s3', 2, '2024-01-02 08:00:00', '/c')],)
Expected Output: [('2024-01-01', 2), ('2024-01-02', 1)]
Explanation: Two sessions fall on 2024-01-01 and one on 2024-01-02.
Input: ([('s1', 2, '2024-03-10 23:59:59', '/'), ('s2', 2, '2024-03-09 00:00:00', '/'), ('s3', 3, '2024-03-10 00:00:00', '/')],)
Expected Output: [('2024-03-09', 1), ('2024-03-10', 2)]
Explanation: The input is unsorted, but the result must be sorted by day.
Input: ([('only', 7, '2024-05-05 12:30:00', '/x')],)
Expected Output: [('2024-05-05', 1)]
Explanation: Single-session edge case.
Input: ([],)
Expected Output: []
Explanation: No sessions means no output rows.
Hints
- You only need the date part of `started_at`, not the time.
- A hash map keyed by day is enough to build the counts efficiently.
Part 2: Daily Sessions by Shop
Constraints
- 0 <= len(shops), len(sessions) <= 200000
- `created_at` and `started_at` are valid ISO-like timestamps in `YYYY-MM-DD HH:MM:SS` format
- Use the deduplication rule exactly as stated before counting or joining
Examples
Input: ([(1, 'Alpha', '2023-01-02 00:00:00'), (1, 'Alpha Old', '2023-01-01 00:00:00'), (2, 'Beta', '2023-01-01 00:00:00')], [('s1', 1, '2024-02-01 09:00:00', '/'), ('s2', 1, '2024-02-01 11:00:00', '/'), ('s3', 2, '2024-02-01 12:00:00', '/'), ('s4', 1, '2024-02-02 10:00:00', '/')])
Expected Output: [('2024-02-01', 1, 'Alpha Old', 2), ('2024-02-01', 2, 'Beta', 1), ('2024-02-02', 1, 'Alpha Old', 1)]
Explanation: For `shop_id = 1`, the earliest `created_at` row is kept, so the attached name is `Alpha Old`.
Input: ([(1, 'Zoo', '2023-01-01 00:00:00'), (1, 'Ace', '2023-01-01 00:00:00')], [('s1', 1, '2024-01-01 00:00:00', '/')])
Expected Output: [('2024-01-01', 1, 'Ace', 1)]
Explanation: The timestamps tie, so the lexicographically smaller name `Ace` wins.
Input: ([(1, 'Only Shop', '2023-01-01 00:00:00')], [('s1', 2, '2024-01-01 10:00:00', '/')])
Expected Output: [('2024-01-01', 2, None, 1)]
Explanation: A session can reference a missing shop; keep the row and use `None` as the name.
Input: ([(1, 'A', '2023-01-01 00:00:00')], [])
Expected Output: []
Explanation: No sessions means no daily shop counts.
Hints
- Solve this in two phases: first build one canonical row per `shop_id`, then aggregate sessions by `(day, shop_id)`.
- Because the timestamp format is fixed, comparing `created_at` strings lexicographically gives the same order as comparing actual timestamps.
Part 3: Daily Sessions by Shop with 7-Day Rolling Average Excluding Today
Constraints
- 0 <= len(sessions) <= 100000
- `started_at` is always a valid timestamp string in `YYYY-MM-DD HH:MM:SS` format
- The product of distinct shop count and number of days in the reporting window does not exceed 200000
- Round each rolling average to 6 decimal places
Examples
Input: ([('a', 1, '2024-01-01 09:00:00', '/'), ('b', 1, '2024-01-03 10:00:00', '/'), ('c', 1, '2024-01-03 12:00:00', '/'), ('d', 2, '2024-01-02 08:00:00', '/')],)
Expected Output: [('2024-01-01', 1, 1, 0.0), ('2024-01-01', 2, 0, 0.0), ('2024-01-02', 1, 0, 0.142857), ('2024-01-02', 2, 1, 0.0), ('2024-01-03', 1, 2, 0.142857), ('2024-01-03', 2, 0, 0.142857)]
Explanation: The calendar spans 2024-01-01 through 2024-01-03, and missing shop/day combinations must be filled with 0.
Input: ([('x', 5, '2024-05-01 00:00:00', '/')],)
Expected Output: [('2024-05-01', 5, 1, 0.0)]
Explanation: On the first day there are no previous days inside the 7-day lookback, so the average is 0.
Input: ([('s1', 1, '2024-01-01 00:00:00', '/'), ('s2', 1, '2024-01-02 00:00:00', '/'), ('s3', 1, '2024-01-03 00:00:00', '/'), ('s4', 1, '2024-01-04 00:00:00', '/'), ('s5', 1, '2024-01-05 00:00:00', '/'), ('s6', 1, '2024-01-06 00:00:00', '/'), ('s7', 1, '2024-01-07 00:00:00', '/'), ('s8', 1, '2024-01-08 00:00:00', '/')],)
Expected Output: [('2024-01-01', 1, 1, 0.0), ('2024-01-02', 1, 1, 0.142857), ('2024-01-03', 1, 1, 0.285714), ('2024-01-04', 1, 1, 0.428571), ('2024-01-05', 1, 1, 0.571429), ('2024-01-06', 1, 1, 0.714286), ('2024-01-07', 1, 1, 0.857143), ('2024-01-08', 1, 1, 1.0)]
Explanation: By 2024-01-08, all seven previous days each had one session, so the rolling average is 1.0.
Input: ([],)
Expected Output: []
Explanation: Empty input produces no rows.
Hints
- First count sessions by `(shop_id, day)`, then generate every calendar day between the minimum and maximum session date.
- For each shop, a sliding window over daily counts lets you compute the rolling average in linear time over the date range.
Part 4: Extract utm_campaign from a URL
Constraints
- 0 <= len(sessions) <= 200000
- Each URL is a valid string
- The parameter name match is case-sensitive and must be exactly `utm_campaign`
Examples
Input: ([('s1', 'https://a.com/?utm_source=google&utm_campaign=summer_sale'), ('s2', 'https://a.com/path?x=1')],)
Expected Output: [('s1', 'summer_sale'), ('s2', None)]
Explanation: The first URL contains `utm_campaign`; the second does not.
Input: ([('s1', 'https://a.com/p?x=1&utm_campaign=winter&y=2#frag')],)
Expected Output: [('s1', 'winter')]
Explanation: The target parameter is present in the query string even though it is not first.
Input: ([('s1', 'https://a.com/?utm_campaign=&utm_source=x&utm_campaign=backup'), ('s2', 'https://a.com/?utm_campaign=')],)
Expected Output: [('s1', 'backup'), ('s2', None)]
Explanation: Use the first non-empty occurrence if the key repeats; an empty-only parameter returns `None`.
Input: ([],)
Expected Output: []
Explanation: Empty input yields empty output.
Hints
- Split the URL into its query string before looking for parameters.
- Be careful with repeated keys and blank values.