Build a Next-Word Predictor
Company: Google
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: easy
Interview Round: Take-home Project
Quick Answer: This question evaluates understanding of n-gram language modeling, probabilistic reasoning, frequency counting, and data structure design for efficient lookup, focusing on coding and algorithms applied to a natural language processing task.
Part 1: Basic Next-Word Predictor
Constraints
- 0 <= len(sentences) <= 100000
- 0 <= total number of tokens across all sentences <= 200000
- 0 <= len(queries) <= 200000
- Each token is a non-empty string
- Only adjacent pairs inside the same sentence count
Examples
Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'ham', 'green', 'missing'])
Expected Output: ['am', 'I', None, 'eggs', None]
Explanation: 'I' is followed by 'am' twice and 'like' once. 'Sam' is followed by 'I' once. 'ham' has no next word.
Input: ([['a', 'b'], ['a', 'c']], ['a'])
Expected Output: ['b']
Explanation: 'b' and 'c' both occur once after 'a', so the lexicographically smaller word is chosen.
Input: ([['solo'], []], ['solo', 'missing'])
Expected Output: [None, None]
Explanation: 'solo' appears only as a complete one-word sentence, so it has no outgoing pair.
Input: ([['x', 'y', 'x']], [])
Expected Output: []
Explanation: No queries means no predictions to return.
Hints
- Use a hash map of hash maps to count how many times each word is followed by another word.
- For each query, scan that word's follower counts and keep the highest count; on ties, choose the smaller word lexicographically.
Part 2: O(1) Average-Time Next-Word Queries
Constraints
- 0 <= len(sentences) <= 100000
- 0 <= total number of tokens across all sentences <= 200000
- 0 <= len(queries) <= 200000
- Each token is a non-empty string
- Average O(1) query time should come from hash-map lookups after preprocessing
Examples
Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'ham'])
Expected Output: ['am', 'I', None]
Explanation: Same correctness rules as the base version.
Input: ([['a', 'c'], ['a', 'b'], ['a', 'b'], ['a', 'c'], ['c', 'd']], ['a', 'c'])
Expected Output: ['b', 'd']
Explanation: After training, 'a' has 'b':2 and 'c':2, so the tie is broken in favor of 'b'.
Input: ([['solo'], ['x', 'y', 'x']], ['solo', 'y', 'missing', 'x'])
Expected Output: [None, 'x', None, 'y']
Explanation: 'solo' has no outgoing pair. In ['x','y','x'], the pairs are x->y and y->x.
Input: ([], ['anything'])
Expected Output: [None]
Explanation: With no training data, every query has no prediction.
Hints
- While counting pairs, also store the current best next word for each source word.
- When a pair count increases, compare it against the stored best count for that source word; on equal counts, apply the lexicographic tie-breaker.
Part 3: Randomized Next-Word Predictor by Empirical Frequency
Constraints
- 0 <= len(sentences) <= 100000
- 0 <= total number of tokens across all sentences <= 200000
- 0 <= len(queries) <= 200000
- 0 <= seed < 2147483648
- Each token is a non-empty string
- Candidate next words must be ordered lexicographically before sampling
Examples
Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'I', 'Sam', 'ham'], 1)
Expected Output: ['am', 'am', 'I', None]
Explanation: For 'I', the distribution is {'am': 2, 'like': 1}. Using the required generator with seed 1 produces 'am', then 'am', then 'I' for 'Sam'.
Input: ([['a', 'c'], ['a', 'b']], ['a', 'a', 'a'], 0)
Expected Output: ['c', 'b', 'c']
Explanation: 'a' has one count each for 'b' and 'c'. With lexicographic order ['b', 'c'] and the specified RNG, the sampled sequence is c, b, c.
Input: ([['solo']], ['solo', 'x'], 5)
Expected Output: [None, None]
Explanation: Neither query has any outgoing pair, so both return None and the RNG state never advances.
Input: ([['a', 'b'], ['a', 'c']], ['x', 'a', 'a'], 0)
Expected Output: [None, 'c', 'b']
Explanation: The first query has no prediction, so the seed is unchanged before the next two weighted samples.
Hints
- For each source word, store follower counts and convert them into cumulative ranges.
- Given r in [0, total_count - 1], find the first cumulative count that is greater than r.