How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a easy difficulty Coding & Algorithms question, commonly asked during Take-home Project rounds at Google.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Google during technical interviews.

Build a Next-Word Predictor | Google Coding Question

Quick Overview

This question evaluates understanding of n-gram language modeling, probabilistic reasoning, frequency counting, and data structure design for efficient lookup, focusing on coding and algorithms applied to a natural language processing task.

Build a Next-Word Predictor

Company: Google

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Take-home Project

Implement a simple next-word model over tokenized training sentences. You need to write two functions: 1. `train(sentences)`: receives a list of tokenized sentences, for example: - `["I", "am", "Sam"]` - `["Sam", "I", "am"]` - `["I", "like", "green", "eggs", "and", "ham"]` 2. `predict(word)`: given a word, return the most likely word that immediately follows it in the training corpus. Assume the model is based only on adjacent word pairs observed in training. For the example above, `predict("I")` should return `"am"` in the base version because `I -> am` appears twice while `I -> like` appears once. Clarify and handle reasonable edge cases such as words never seen in training, words that only appear at the end of a sentence, and ties. Follow-up 1: In a naive implementation, `predict(word)` may take `O(k)` time where `k` is the number of distinct next words seen after that word. How would you redesign training so that prediction becomes `O(1)` average time? Follow-up 2: Modify `predict(word)` so it returns a random next word according to the empirical distribution of observed next words. For example, if `I -> am` occurs 2 times and `I -> like` occurs 1 time, then `predict("I")` should return `"am"` with probability `2/3` and `"like"` with probability `1/3`.

Quick Answer: This question evaluates understanding of n-gram language modeling, probabilistic reasoning, frequency counting, and data structure design for efficient lookup, focusing on coding and algorithms applied to a natural language processing task.

Part 1: Basic Next-Word Predictor

You are given tokenized training sentences and a list of query words. Build a simple next-word model using only adjacent word pairs within the same sentence. For each query word, return the most frequent word that immediately follows it in the training data. If a queried word never appears with a following word, return None. This includes words never seen in training and words that only appear at the end of sentences. If multiple next words are tied for highest frequency, return the lexicographically smallest one. Implement everything inside one function: train on the given sentences, then answer all queries.

Constraints

0 <= len(sentences) <= 100000
0 <= total number of tokens across all sentences <= 200000
0 <= len(queries) <= 200000
Each token is a non-empty string
Only adjacent pairs inside the same sentence count

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'ham', 'green', 'missing'])

Loading coding console...

Part 2: O(1) Average-Time Next-Word Queries

You are given tokenized training sentences and many query words. Build a next-word predictor using adjacent word pairs, but redesign preprocessing so each prediction can be answered in O(1) average time after training. Return the same answers as in the basic version: - For each query word, return the most frequent word that immediately follows it. - If the word has no recorded next word, return None. - If multiple next words have the same highest frequency, return the lexicographically smallest one. Your function should preprocess the training data and then answer all queries efficiently without scanning every possible next word during each query.

Constraints

0 <= len(sentences) <= 100000
0 <= total number of tokens across all sentences <= 200000
0 <= len(queries) <= 200000
Each token is a non-empty string
Average O(1) query time should come from hash-map lookups after preprocessing

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'ham'])

Expected Output: ['am', 'I', None]

Explanation: Same correctness rules as the base version.

Input: ([['a', 'c'], ['a', 'b'], ['a', 'b'], ['a', 'c'], ['c', 'd']], ['a', 'c'])

Expected Output: ['b', 'd']

Explanation: After training, 'a' has 'b':2 and 'c':2, so the tie is broken in favor of 'b'.

Input: ([['solo'], ['x', 'y', 'x']], ['solo', 'y', 'missing', 'x'])

Expected Output: [None, 'x', None, 'y']

Explanation: 'solo' has no outgoing pair. In ['x','y','x'], the pairs are x->y and y->x.

Input: ([], ['anything'])

Expected Output: [None]

Explanation: With no training data, every query has no prediction.

Hints

While counting pairs, also store the current best next word for each source word.
When a pair count increases, compare it against the stored best count for that source word; on equal counts, apply the lexicographic tie-breaker.

Part 3: Randomized Next-Word Predictor by Empirical Frequency

You are given tokenized training sentences and query words. Build a next-word model using adjacent word pairs within the same sentence. For each query, return a random next word according to the empirical distribution observed in training. Example: if 'I' is followed by 'am' 2 times and 'like' 1 time, then predictions from 'I' should choose 'am' with probability 2/3 and 'like' with probability 1/3. To make the output deterministic for testing, use this exact seeded pseudo-random process: 1. Maintain an integer state, initially equal to seed. 2. For each query word that has at least one possible next word, update: state = (1103515245 * state + 12345) % 2147483648 3. Let r = state % total_count(word), where total_count(word) is the total number of observed followers of that word. 4. Sort the candidate next words for that word lexicographically, build cumulative counts in that order, and return the first word whose cumulative count is greater than r. 5. If a query word has no possible next word, return None and do not advance the state. Implement everything inside one function.

Constraints

0 <= len(sentences) <= 100000
0 <= total number of tokens across all sentences <= 200000
0 <= len(queries) <= 200000
0 <= seed < 2147483648
Each token is a non-empty string
Candidate next words must be ordered lexicographically before sampling

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'I', 'Sam', 'ham'], 1)

Expected Output: ['am', 'am', 'I', None]

Explanation: For 'I', the distribution is {'am': 2, 'like': 1}. Using the required generator with seed 1 produces 'am', then 'am', then 'I' for 'Sam'.

Input: ([['a', 'c'], ['a', 'b']], ['a', 'a', 'a'], 0)

Expected Output: ['c', 'b', 'c']

Explanation: 'a' has one count each for 'b' and 'c'. With lexicographic order ['b', 'c'] and the specified RNG, the sampled sequence is c, b, c.

Input: ([['solo']], ['solo', 'x'], 5)

Expected Output: [None, None]

Explanation: Neither query has any outgoing pair, so both return None and the RNG state never advances.

Input: ([['a', 'b'], ['a', 'c']], ['x', 'a', 'a'], 0)

Expected Output: [None, 'c', 'b']

Explanation: The first query has no prediction, so the seed is unchanged before the next two weighted samples.

Hints

For each source word, store follower counts and convert them into cumulative ranges.
Given r in [0, total_count - 1], find the first cumulative count that is greater than r.