Implement list overlap and dense-ranked word frequencies
Company: Amazon
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Part A — Multiset overlap count
Implement count_overlap(a: List[int], b: List[int]) -> int that returns the size of the multiset intersection of a and b (order doesn’t matter; duplicates count). Constraints: 0 ≤ len(a), len(b) ≤ 1e7; values fit 32-bit signed int. Time O(n + m), space O(min(n, m)). Disallow sorting if it would exceed O((n+m) log(n+m)). Edge cases: empty lists, all duplicates, heavy skew.
Examples:
• a=[1,1,2,3], b=[1,2,2,4] → overlap elements {1×1,2×1} so return 2
• a=[5,5,5], b=[5,5] → 2
Also implement overlap_elements(a,b) returning the actual multiset intersection as a generator without materializing more than O(min(n,m)) extra memory.
Part B — Top-K words with dense ranks
Implement top_k_words(words: Iterable[str], k: int=5) -> List[Tuple[str,int,int]] that returns (word, dense_rank, freq) sorted by descending freq then lexicographically for ties. Dense ranks compress gaps: if freqs are [10,10,8,7,7,7,5], ranks are [1,1,2,3,3,3,4]. Complexity: single pass over input (streaming-friendly), O(U) memory for U unique words. Handle Unicode and case-folding option via a flag.
Example input: ["a","b","b","c","c","c","d","d","d","e"]
Expected output (k=5): [("c",1,3),("d",1,3),("b",2,2),("a",3,1),("e",3,1)]
Follow-ups: analyze time/space; discuss alternatives for 10^9 tokens (approximate top-k via count-min sketch + heap) and stability considerations.
Quick Answer: This question evaluates a data scientist's ability to implement efficient multiset intersection and streaming top-k word-frequency computations, covering duplicate-aware counting, dense-ranking, Unicode handling, and strict time/space constraints.
Multiset Overlap Count
Return the size of the multiset intersection, or the sorted overlap elements when requested.
Constraints
- Values fit in 32-bit signed integers
Examples
Input: ([1, 1, 2, 3], [1, 2, 2, 4], False)
Expected Output: 2
Explanation: Counts duplicate-limited overlap.
Input: ([5, 5, 5], [5, 5], False)
Expected Output: 2
Explanation: All duplicates.
Input: ([], [1], False)
Expected Output: 0
Explanation: Empty input.
Input: ([3, 1, 1], [1, 1, 1, 3], True)
Expected Output: [1, 1, 3]
Explanation: Return actual overlap elements.
Hints
- Count the smaller list, then consume counts while scanning the larger list.
Top-K Words with Dense Ranks
Return (word, dense_rank, freq) sorted by descending frequency then lexicographically.
Constraints
- Dense ranks compress frequency gaps
Examples
Input: (['a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'e'], 5, False)
Expected Output: [('c', 1, 3), ('d', 1, 3), ('b', 2, 2), ('a', 3, 1), ('e', 3, 1)]
Explanation: Prompt example.
Input: (['Apple', 'apple', 'banana'], 2, True)
Expected Output: [('apple', 1, 2), ('banana', 2, 1)]
Explanation: Case-folding option.
Input: ([], 5, False)
Expected Output: []
Explanation: No words.
Hints
- Sort unique words by (-freq, word), then increment rank only when frequency changes.