Implement top-K over a stream
Company: Coinbase
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Take-home Project
Quick Answer: This question evaluates a candidate's proficiency in stream-processing algorithms and scalable data structures for frequency estimation and top-K queries, emphasizing analysis of time/space trade-offs, sliding-window semantics, and distributed aggregation.
Constraints
- 1 <= number of operations <= 10^5
- Each insert item is a non-empty string.
- 0 <= k for a topk query; if k exceeds the number of distinct items, return all of them.
- Frequency ties are broken by item value in ascending (lexicographic) order.
Examples
Input: ([['insert', 'a'], ['insert', 'b'], ['insert', 'a'], ['topk', 1]],)
Expected Output: [['a']]
Explanation: a has count 2, b has count 1; the single most frequent item is a.
Input: ([['insert', 'x'], ['insert', 'y'], ['insert', 'x'], ['insert', 'z'], ['insert', 'y'], ['insert', 'x'], ['topk', 2]],)
Expected Output: [['x', 'y']]
Explanation: Counts: x=3, y=2, z=1. Top 2 are x then y.
Input: ([['insert', 'a'], ['insert', 'b'], ['topk', 5]],)
Expected Output: [['a', 'b']]
Explanation: Only 2 distinct items exist, fewer than k=5, so both are returned; a and b tie at count 1 and are ordered ascending.
Input: ([['topk', 3]],)
Expected Output: [[]]
Explanation: No items have been inserted, so the top-K of an empty stream is an empty list.
Input: ([['insert', 'a'], ['insert', 'a'], ['topk', 1], ['insert', 'b'], ['insert', 'b'], ['insert', 'b'], ['topk', 1]],)
Expected Output: [['a'], ['b']]
Explanation: After the first two inserts a leads (count 2); after three b inserts, b (count 3) overtakes a, so the second query returns b.
Input: ([['insert', 'c'], ['insert', 'a'], ['insert', 'b'], ['topk', 2]],)
Expected Output: [['a', 'b']]
Explanation: All three items tie at count 1; the deterministic ascending tie-break orders them a, b, c, so the top 2 are a and b.
Hints
- Maintain a hash map from item to its running count; an insert is a single increment.
- For a topk query you do not need a full global ordering — a min-heap of size k over (count, item) pairs gives the answer in O(D log k).
- Decide a deterministic tie-break (here: lower item value wins) so equal-frequency items have a stable order.
- For the sliding-window follow-up, bucket counts by time slice and subtract expired buckets; for the distributed follow-up, partition by item hash and merge per-partition top-K results.