PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of n-gram language modeling, probabilistic reasoning, frequency counting, and data structure design for efficient lookup, focusing on coding and algorithms applied to a natural language processing task.

  • easy
  • Google
  • Coding & Algorithms
  • Data Scientist

Build a Next-Word Predictor

Company: Google

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Take-home Project

Implement a simple next-word model over tokenized training sentences. You need to write two functions: 1. `train(sentences)`: receives a list of tokenized sentences, for example: - `["I", "am", "Sam"]` - `["Sam", "I", "am"]` - `["I", "like", "green", "eggs", "and", "ham"]` 2. `predict(word)`: given a word, return the most likely word that immediately follows it in the training corpus. Assume the model is based only on adjacent word pairs observed in training. For the example above, `predict("I")` should return `"am"` in the base version because `I -> am` appears twice while `I -> like` appears once. Clarify and handle reasonable edge cases such as words never seen in training, words that only appear at the end of a sentence, and ties. Follow-up 1: In a naive implementation, `predict(word)` may take `O(k)` time where `k` is the number of distinct next words seen after that word. How would you redesign training so that prediction becomes `O(1)` average time? Follow-up 2: Modify `predict(word)` so it returns a random next word according to the empirical distribution of observed next words. For example, if `I -> am` occurs 2 times and `I -> like` occurs 1 time, then `predict("I")` should return `"am"` with probability `2/3` and `"like"` with probability `1/3`.

Quick Answer: This question evaluates understanding of n-gram language modeling, probabilistic reasoning, frequency counting, and data structure design for efficient lookup, focusing on coding and algorithms applied to a natural language processing task.

Part 1: Basic Next-Word Predictor

You are given tokenized training sentences and a list of query words. Build a simple next-word model using only adjacent word pairs within the same sentence. For each query word, return the most frequent word that immediately follows it in the training data. If a queried word never appears with a following word, return None. This includes words never seen in training and words that only appear at the end of sentences. If multiple next words are tied for highest frequency, return the lexicographically smallest one. Implement everything inside one function: train on the given sentences, then answer all queries.

Constraints

  • 0 <= len(sentences) <= 100000
  • 0 <= total number of tokens across all sentences <= 200000
  • 0 <= len(queries) <= 200000
  • Each token is a non-empty string
  • Only adjacent pairs inside the same sentence count

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'ham', 'green', 'missing'])

Expected Output: ['am', 'I', None, 'eggs', None]

Explanation: 'I' is followed by 'am' twice and 'like' once. 'Sam' is followed by 'I' once. 'ham' has no next word.

Input: ([['a', 'b'], ['a', 'c']], ['a'])

Expected Output: ['b']

Explanation: 'b' and 'c' both occur once after 'a', so the lexicographically smaller word is chosen.

Input: ([['solo'], []], ['solo', 'missing'])

Expected Output: [None, None]

Explanation: 'solo' appears only as a complete one-word sentence, so it has no outgoing pair.

Input: ([['x', 'y', 'x']], [])

Expected Output: []

Explanation: No queries means no predictions to return.

Hints

  1. Use a hash map of hash maps to count how many times each word is followed by another word.
  2. For each query, scan that word's follower counts and keep the highest count; on ties, choose the smaller word lexicographically.

Part 2: O(1) Average-Time Next-Word Queries

You are given tokenized training sentences and many query words. Build a next-word predictor using adjacent word pairs, but redesign preprocessing so each prediction can be answered in O(1) average time after training. Return the same answers as in the basic version: - For each query word, return the most frequent word that immediately follows it. - If the word has no recorded next word, return None. - If multiple next words have the same highest frequency, return the lexicographically smallest one. Your function should preprocess the training data and then answer all queries efficiently without scanning every possible next word during each query.

Constraints

  • 0 <= len(sentences) <= 100000
  • 0 <= total number of tokens across all sentences <= 200000
  • 0 <= len(queries) <= 200000
  • Each token is a non-empty string
  • Average O(1) query time should come from hash-map lookups after preprocessing

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'ham'])

Expected Output: ['am', 'I', None]

Explanation: Same correctness rules as the base version.

Input: ([['a', 'c'], ['a', 'b'], ['a', 'b'], ['a', 'c'], ['c', 'd']], ['a', 'c'])

Expected Output: ['b', 'd']

Explanation: After training, 'a' has 'b':2 and 'c':2, so the tie is broken in favor of 'b'.

Input: ([['solo'], ['x', 'y', 'x']], ['solo', 'y', 'missing', 'x'])

Expected Output: [None, 'x', None, 'y']

Explanation: 'solo' has no outgoing pair. In ['x','y','x'], the pairs are x->y and y->x.

Input: ([], ['anything'])

Expected Output: [None]

Explanation: With no training data, every query has no prediction.

Hints

  1. While counting pairs, also store the current best next word for each source word.
  2. When a pair count increases, compare it against the stored best count for that source word; on equal counts, apply the lexicographic tie-breaker.

Part 3: Randomized Next-Word Predictor by Empirical Frequency

You are given tokenized training sentences and query words. Build a next-word model using adjacent word pairs within the same sentence. For each query, return a random next word according to the empirical distribution observed in training. Example: if 'I' is followed by 'am' 2 times and 'like' 1 time, then predictions from 'I' should choose 'am' with probability 2/3 and 'like' with probability 1/3. To make the output deterministic for testing, use this exact seeded pseudo-random process: 1. Maintain an integer state, initially equal to seed. 2. For each query word that has at least one possible next word, update: state = (1103515245 * state + 12345) % 2147483648 3. Let r = state % total_count(word), where total_count(word) is the total number of observed followers of that word. 4. Sort the candidate next words for that word lexicographically, build cumulative counts in that order, and return the first word whose cumulative count is greater than r. 5. If a query word has no possible next word, return None and do not advance the state. Implement everything inside one function.

Constraints

  • 0 <= len(sentences) <= 100000
  • 0 <= total number of tokens across all sentences <= 200000
  • 0 <= len(queries) <= 200000
  • 0 <= seed < 2147483648
  • Each token is a non-empty string
  • Candidate next words must be ordered lexicographically before sampling

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'I', 'Sam', 'ham'], 1)

Expected Output: ['am', 'am', 'I', None]

Explanation: For 'I', the distribution is {'am': 2, 'like': 1}. Using the required generator with seed 1 produces 'am', then 'am', then 'I' for 'Sam'.

Input: ([['a', 'c'], ['a', 'b']], ['a', 'a', 'a'], 0)

Expected Output: ['c', 'b', 'c']

Explanation: 'a' has one count each for 'b' and 'c'. With lexicographic order ['b', 'c'] and the specified RNG, the sampled sequence is c, b, c.

Input: ([['solo']], ['solo', 'x'], 5)

Expected Output: [None, None]

Explanation: Neither query has any outgoing pair, so both return None and the RNG state never advances.

Input: ([['a', 'b'], ['a', 'c']], ['x', 'a', 'a'], 0)

Expected Output: [None, 'c', 'b']

Explanation: The first query has no prediction, so the seed is unchanged before the next two weighted samples.

Hints

  1. For each source word, store follower counts and convert them into cumulative ranges.
  2. Given r in [0, total_count - 1], find the first cumulative count that is greater than r.
Last updated: Apr 26, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Solve Rooms and Top-K Streams - Google (medium)
  • Find Containing Range - Google (medium)
  • Rearrange Tasks With Cooldown - Google (medium)
  • Implement Employee Management and Expression Evaluation - Google (medium)
  • Solve Three Array and Matrix Path Problems - Google (medium)