PracHub
QuestionsCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates a data scientist's ability to implement efficient multiset intersection and streaming top-k word-frequency computations, covering duplicate-aware counting, dense-ranking, Unicode handling, and strict time/space constraints.

  • Medium
  • Amazon
  • Coding & Algorithms
  • Data Scientist

Implement list overlap and dense-ranked word frequencies

Company: Amazon

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

Part A — Multiset overlap count Implement count_overlap(a: List[int], b: List[int]) -> int that returns the size of the multiset intersection of a and b (order doesn’t matter; duplicates count). Constraints: 0 ≤ len(a), len(b) ≤ 1e7; values fit 32-bit signed int. Time O(n + m), space O(min(n, m)). Disallow sorting if it would exceed O((n+m) log(n+m)). Edge cases: empty lists, all duplicates, heavy skew. Examples: • a=[1,1,2,3], b=[1,2,2,4] → overlap elements {1×1,2×1} so return 2 • a=[5,5,5], b=[5,5] → 2 Also implement overlap_elements(a,b) returning the actual multiset intersection as a generator without materializing more than O(min(n,m)) extra memory. Part B — Top-K words with dense ranks Implement top_k_words(words: Iterable[str], k: int=5) -> List[Tuple[str,int,int]] that returns (word, dense_rank, freq) sorted by descending freq then lexicographically for ties. Dense ranks compress gaps: if freqs are [10,10,8,7,7,7,5], ranks are [1,1,2,3,3,3,4]. Complexity: single pass over input (streaming-friendly), O(U) memory for U unique words. Handle Unicode and case-folding option via a flag. Example input: ["a","b","b","c","c","c","d","d","d","e"] Expected output (k=5): [("c",1,3),("d",1,3),("b",2,2),("a",3,1),("e",3,1)] Follow-ups: analyze time/space; discuss alternatives for 10^9 tokens (approximate top-k via count-min sketch + heap) and stability considerations.

Quick Answer: This question evaluates a data scientist's ability to implement efficient multiset intersection and streaming top-k word-frequency computations, covering duplicate-aware counting, dense-ranking, Unicode handling, and strict time/space constraints.

Multiset Overlap Count

Return the size of the multiset intersection, or the sorted overlap elements when requested.

Constraints

  • Values fit in 32-bit signed integers

Examples

Input: ([1, 1, 2, 3], [1, 2, 2, 4], False)

Expected Output: 2

Explanation: Counts duplicate-limited overlap.

Input: ([5, 5, 5], [5, 5], False)

Expected Output: 2

Explanation: All duplicates.

Input: ([], [1], False)

Expected Output: 0

Explanation: Empty input.

Input: ([3, 1, 1], [1, 1, 1, 3], True)

Expected Output: [1, 1, 3]

Explanation: Return actual overlap elements.

Hints

  1. Count the smaller list, then consume counts while scanning the larger list.

Top-K Words with Dense Ranks

Return (word, dense_rank, freq) sorted by descending frequency then lexicographically.

Constraints

  • Dense ranks compress frequency gaps

Examples

Input: (['a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'e'], 5, False)

Expected Output: [('c', 1, 3), ('d', 1, 3), ('b', 2, 2), ('a', 3, 1), ('e', 3, 1)]

Explanation: Prompt example.

Input: (['Apple', 'apple', 'banana'], 2, True)

Expected Output: [('apple', 1, 2), ('banana', 2, 1)]

Explanation: Case-folding option.

Input: ([], 5, False)

Expected Output: []

Explanation: No words.

Hints

  1. Sort unique words by (-freq, word), then increment rank only when frequency changes.
Last updated: Jun 27, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement Top-p (Nucleus) Sampling in NumPy - Amazon (medium)
  • Implement Multi-Head Attention from Scratch in NumPy - Amazon (medium)
  • Detect and Break a Cycle in a Singly Linked List - Amazon (medium)
  • Caesar Cipher with Translation-Table Optimization - Amazon (medium)
  • Minimum Drone Delivery Time on a Ring of Hubs - Amazon (medium)