PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates skills in language modeling, data structures, algorithmic optimization, and probabilistic sampling, within the Coding & Algorithms domain for a Data Scientist role, and primarily tests practical implementation and performance trade-offs rather than purely theoretical concepts.

  • easy
  • Google
  • Coding & Algorithms
  • Data Scientist

Build next-word predictor with O(1) lookup

Company: Google

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Take-home Project

## Problem You are given a training corpus where each training example is a tokenized sentence (array of words). Example training sentences: - `["I", "am", "Sam"]` - `["Sam", "I", "am"]` - `["I", "like", "green", "eggs", "and", "ham"]` Implement two functions: 1. `train(sentences)` 2. `predict(word)` ### Base requirement After calling `train`, `predict(word)` should return a *possible* next word that appears immediately after `word` somewhere in the training data. - If `word` never appears, or it never has a following word (e.g., it only appears as the last token), define and document a reasonable behavior (e.g., return `None`). ### Follow-up 1 (time complexity) A straightforward solution stores `word -> {next_word: count}` and, during `predict`, scans candidates to decide what to return (time complexity `O(k)` where `k` is the number of distinct next-words). Optimize so that **each `predict(word)` call is `O(1)` time** (not counting the cost of returning the string), by doing more work in `train`. ### Follow-up 2 (probabilistic prediction) Change `predict(word)` so it returns a next word **randomly according to empirical frequencies** from training. Example: for the word `"I"`, if the training data implies: - `I -> am` occurs 2 times - `I -> like` occurs 1 time then `predict("I")` should return: - `"am"` with probability `2/3` - `"like"` with probability `1/3` ### Notes / assumptions to clarify - Tokenization is already done (input is arrays of strings). - Punctuation/casing rules can be assumed consistent. - You may assume training is called once, then predict is called many times (so preprocessing is worthwhile). - State whether `predict` must be deterministic (it should not be for follow-up 2).

Quick Answer: This question evaluates skills in language modeling, data structures, algorithmic optimization, and probabilistic sampling, within the Coding & Algorithms domain for a Data Scientist role, and primarily tests practical implementation and performance trade-offs rather than purely theoretical concepts.

Part 1: First Observed Next-Word Predictor

You are given a training corpus as a list of tokenized sentences and a list of query words. Build a next-word predictor by scanning the training sentences in order, left to right. For every adjacent pair word -> next_word, remember the first next_word ever seen after that word. For each query, return that remembered next word. If a word never appears with a following token, return None. This is a deterministic version of train once and predict many times.

Constraints

  • 0 <= len(sentences) <= 100000
  • The total number of tokens across all sentences is at most 200000
  • 0 <= len(queries) <= 100000
  • A sentence may be empty or contain one token
  • Each token is a non-empty string

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'like', 'ham', 'missing'])

Expected Output: ['am', 'I', 'green', None, None]

Explanation: The first observed followers are I -> am, Sam -> I, and like -> green. The word ham has no follower, and missing never appears.

Input: ([[], ['solo']], ['solo', 'anything'])

Expected Output: [None, None]

Explanation: There are no adjacent pairs at all, so every query returns None.

Input: ([['a', 'b', 'c'], ['a', 'd'], ['d']], ['a', 'b', 'd'])

Expected Output: ['b', 'c', None]

Explanation: The first observed follower of a is b. The follower of b is c. The word d never has a following token.

Input: ([['one', 'two']], [])

Expected Output: []

Explanation: No queries means the returned list is empty.

Hints

  1. Only adjacent words inside the same sentence matter.
  2. To keep the result deterministic, store a next word only the first time you see a word followed by something.

Part 2: O(1) Lookup for Most Frequent Next Word

You are given tokenized training sentences and query words. For each word, predict the immediate next word that appears most often after it in the training corpus. If several next words tie for the highest frequency, return the one whose first occurrence after that word appeared earliest in training order. Preprocess during training so that every query can be answered in O(1) time with a direct lookup.

Constraints

  • 0 <= len(sentences) <= 100000
  • The total number of tokens across all sentences is at most 200000
  • 0 <= len(queries) <= 100000
  • A sentence may be empty or contain one token
  • Each token is a non-empty string

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'am', 'ham', 'missing'])

Expected Output: ['am', 'I', 'Sam', None, None]

Explanation: I is followed by am twice and like once, so predict am. Sam is followed by I once. The word ham has no follower.

Input: ([['a', 'x'], ['a', 'y'], ['a', 'x'], ['a', 'y']], ['a', 'x', 'y'])

Expected Output: ['x', None, None]

Explanation: For a, both x and y occur twice, so the tie is broken by first appearance. x was seen first.

Input: ([[], ['solo']], ['solo', 'anything'])

Expected Output: [None, None]

Explanation: No word is ever followed by another token.

Input: ([['go', 'left', 'go', 'right', 'go', 'left']], ['go', 'left', 'right'])

Expected Output: ['left', 'go', 'go']

Explanation: go is followed by left twice and right once. left is followed by go once, and right is followed by go once.

Hints

  1. If every query must be O(1), do not wait until query time to choose the best follower.
  2. While training, track both the count of each follower and the tie-break information needed to know which follower is currently best.

Part 3: Weighted Random Next-Word Prediction

You are given tokenized training sentences and queries of the form (word, r), where 0 <= r < 1 is a pre-generated random number. For each word, count how often each immediate follower appears. Keep the distinct followers in the order they were first seen after that word. If the follower counts are c1, c2, ..., ck with total T, then choose follower i when r falls in the interval [sum of previous counts / T, sum through i / T). Return None if the word never appears with a follower. This deterministic interface simulates random prediction according to empirical frequencies.

Constraints

  • 0 <= len(sentences) <= 100000
  • The total number of tokens across all sentences is at most 200000
  • 0 <= len(queries) <= 100000
  • A sentence may be empty or contain one token
  • Each token is a non-empty string
  • Each query uses a float r such that 0 <= r < 1

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], [('I', 0.0), ('I', 0.8), ('Sam', 0.2), ('ham', 0.3), ('missing', 0.1)])

Expected Output: ['am', 'like', 'I', None, None]

Explanation: For I, the follower counts are am: 2 and like: 1, in that order. So [0, 2/3) maps to am and [2/3, 1) maps to like.

Input: ([['a', 'x'], ['a', 'y'], ['a', 'y'], ['a', 'x']], [('a', 0.49), ('a', 0.5), ('x', 0.1)])

Expected Output: ['x', 'y', None]

Explanation: After a, x and y both have count 2. Because x appeared first, the intervals are x on [0, 0.5) and y on [0.5, 1).

Input: ([[], ['solo']], [('solo', 0.2), ('anything', 0.9)])

Expected Output: [None, None]

Explanation: No adjacent pairs exist, so every query returns None.

Input: ([['go', 'left', 'go', 'right', 'go', 'left']], [('go', 0.2), ('go', 0.9), ('left', 0.0), ('right', 0.0)])

Expected Output: ['left', 'right', 'go', 'go']

Explanation: For go, left has count 2 and right has count 1, so left covers the first two thirds of the interval and right covers the last third.

Hints

  1. Weighted random selection can be modeled as choosing an interval on [0, 1).
  2. Build cumulative counts for each word, then locate where r lands among those cumulative totals.
Last updated: Apr 23, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Solve Rooms and Top-K Streams - Google (medium)
  • Find Containing Range - Google (medium)
  • Rearrange Tasks With Cooldown - Google (medium)
  • Implement Employee Management and Expression Evaluation - Google (medium)
  • Solve Three Array and Matrix Path Problems - Google (medium)