Build next-word predictor with O(1) lookup
Company: Google
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: easy
Interview Round: Take-home Project
Quick Answer: This question evaluates skills in language modeling, data structures, algorithmic optimization, and probabilistic sampling, within the Coding & Algorithms domain for a Data Scientist role, and primarily tests practical implementation and performance trade-offs rather than purely theoretical concepts.
Part 1: First Observed Next-Word Predictor
Constraints
- 0 <= len(sentences) <= 100000
- The total number of tokens across all sentences is at most 200000
- 0 <= len(queries) <= 100000
- A sentence may be empty or contain one token
- Each token is a non-empty string
Examples
Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'like', 'ham', 'missing'])
Expected Output: ['am', 'I', 'green', None, None]
Explanation: The first observed followers are I -> am, Sam -> I, and like -> green. The word ham has no follower, and missing never appears.
Input: ([[], ['solo']], ['solo', 'anything'])
Expected Output: [None, None]
Explanation: There are no adjacent pairs at all, so every query returns None.
Input: ([['a', 'b', 'c'], ['a', 'd'], ['d']], ['a', 'b', 'd'])
Expected Output: ['b', 'c', None]
Explanation: The first observed follower of a is b. The follower of b is c. The word d never has a following token.
Input: ([['one', 'two']], [])
Expected Output: []
Explanation: No queries means the returned list is empty.
Hints
- Only adjacent words inside the same sentence matter.
- To keep the result deterministic, store a next word only the first time you see a word followed by something.
Part 2: O(1) Lookup for Most Frequent Next Word
Constraints
- 0 <= len(sentences) <= 100000
- The total number of tokens across all sentences is at most 200000
- 0 <= len(queries) <= 100000
- A sentence may be empty or contain one token
- Each token is a non-empty string
Examples
Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'am', 'ham', 'missing'])
Expected Output: ['am', 'I', 'Sam', None, None]
Explanation: I is followed by am twice and like once, so predict am. Sam is followed by I once. The word ham has no follower.
Input: ([['a', 'x'], ['a', 'y'], ['a', 'x'], ['a', 'y']], ['a', 'x', 'y'])
Expected Output: ['x', None, None]
Explanation: For a, both x and y occur twice, so the tie is broken by first appearance. x was seen first.
Input: ([[], ['solo']], ['solo', 'anything'])
Expected Output: [None, None]
Explanation: No word is ever followed by another token.
Input: ([['go', 'left', 'go', 'right', 'go', 'left']], ['go', 'left', 'right'])
Expected Output: ['left', 'go', 'go']
Explanation: go is followed by left twice and right once. left is followed by go once, and right is followed by go once.
Hints
- If every query must be O(1), do not wait until query time to choose the best follower.
- While training, track both the count of each follower and the tie-break information needed to know which follower is currently best.
Part 3: Weighted Random Next-Word Prediction
Constraints
- 0 <= len(sentences) <= 100000
- The total number of tokens across all sentences is at most 200000
- 0 <= len(queries) <= 100000
- A sentence may be empty or contain one token
- Each token is a non-empty string
- Each query uses a float r such that 0 <= r < 1
Examples
Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], [('I', 0.0), ('I', 0.8), ('Sam', 0.2), ('ham', 0.3), ('missing', 0.1)])
Expected Output: ['am', 'like', 'I', None, None]
Explanation: For I, the follower counts are am: 2 and like: 1, in that order. So [0, 2/3) maps to am and [2/3, 1) maps to like.
Input: ([['a', 'x'], ['a', 'y'], ['a', 'y'], ['a', 'x']], [('a', 0.49), ('a', 0.5), ('x', 0.1)])
Expected Output: ['x', 'y', None]
Explanation: After a, x and y both have count 2. Because x appeared first, the intervals are x on [0, 0.5) and y on [0.5, 1).
Input: ([[], ['solo']], [('solo', 0.2), ('anything', 0.9)])
Expected Output: [None, None]
Explanation: No adjacent pairs exist, so every query returns None.
Input: ([['go', 'left', 'go', 'right', 'go', 'left']], [('go', 0.2), ('go', 0.9), ('left', 0.0), ('right', 0.0)])
Expected Output: ['left', 'right', 'go', 'go']
Explanation: For go, left has count 2 and right has count 1, so left covers the first two thirds of the interval and right covers the last third.
Hints
- Weighted random selection can be modeled as choosing an interval on [0, 1).
- Build cumulative counts for each word, then locate where r lands among those cumulative totals.