Implement list overlap and dense-ranked word frequencies
Company: Amazon
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Part A — Multiset overlap count
Implement count_overlap(a: List[int], b: List[int]) -> int that returns the size of the multiset intersection of a and b (order doesn’t matter; duplicates count). Constraints: 0 ≤ len(a), len(b) ≤ 1e7; values fit 32-bit signed int. Time O(n + m), space O(min(n, m)). Disallow sorting if it would exceed O((n+m) log(n+m)). Edge cases: empty lists, all duplicates, heavy skew.
Examples:
• a=[1,1,2,3], b=[1,2,2,4] → overlap elements {1×1,2×1} so return 2
• a=[5,5,5], b=[5,5] → 2
Also implement overlap_elements(a,b) returning the actual multiset intersection as a generator without materializing more than O(min(n,m)) extra memory.
Part B — Top-K words with dense ranks
Implement top_k_words(words: Iterable[str], k: int=5) -> List[Tuple[str,int,int]] that returns (word, dense_rank, freq) sorted by descending freq then lexicographically for ties. Dense ranks compress gaps: if freqs are [10,10,8,7,7,7,5], ranks are [1,1,2,3,3,3,4]. Complexity: single pass over input (streaming-friendly), O(U) memory for U unique words. Handle Unicode and case-folding option via a flag.
Example input: ["a","b","b","c","c","c","d","d","d","e"]
Expected output (k=5): [("c",1,3),("d",1,3),("b",2,2),("a",3,1),("e",3,1)]
Follow-ups: analyze time/space; discuss alternatives for 10^9 tokens (approximate top-k via count-min sketch + heap) and stability considerations.
Quick Answer: This question evaluates a data scientist's ability to implement efficient multiset intersection and streaming top-k word-frequency computations, covering duplicate-aware counting, dense-ranking, Unicode handling, and strict time/space constraints.