How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a easy difficulty Coding & Algorithms question, commonly asked during Take-home Project rounds at Google.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Google during technical interviews.

Build next-word predictor with O(1) lookup

Quick Overview

This question evaluates skills in language modeling, data structures, algorithmic optimization, and probabilistic sampling, within the Coding & Algorithms domain for a Data Scientist role, and primarily tests practical implementation and performance trade-offs rather than purely theoretical concepts.

Build next-word predictor with O(1) lookup

Company: Google

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Take-home Project

## Problem You are given a training corpus where each training example is a tokenized sentence (array of words). Example training sentences: - `["I", "am", "Sam"]` - `["Sam", "I", "am"]` - `["I", "like", "green", "eggs", "and", "ham"]` Implement two functions: 1. `train(sentences)` 2. `predict(word)` ### Base requirement After calling `train`, `predict(word)` should return a *possible* next word that appears immediately after `word` somewhere in the training data. - If `word` never appears, or it never has a following word (e.g., it only appears as the last token), define and document a reasonable behavior (e.g., return `None`). ### Follow-up 1 (time complexity) A straightforward solution stores `word -> {next_word: count}` and, during `predict`, scans candidates to decide what to return (time complexity `O(k)` where `k` is the number of distinct next-words). Optimize so that **each `predict(word)` call is `O(1)` time** (not counting the cost of returning the string), by doing more work in `train`. ### Follow-up 2 (probabilistic prediction) Change `predict(word)` so it returns a next word **randomly according to empirical frequencies** from training. Example: for the word `"I"`, if the training data implies: - `I -> am` occurs 2 times - `I -> like` occurs 1 time then `predict("I")` should return: - `"am"` with probability `2/3` - `"like"` with probability `1/3` ### Notes / assumptions to clarify - Tokenization is already done (input is arrays of strings). - Punctuation/casing rules can be assumed consistent. - You may assume training is called once, then predict is called many times (so preprocessing is worthwhile). - State whether `predict` must be deterministic (it should not be for follow-up 2).

Quick Answer: This question evaluates skills in language modeling, data structures, algorithmic optimization, and probabilistic sampling, within the Coding & Algorithms domain for a Data Scientist role, and primarily tests practical implementation and performance trade-offs rather than purely theoretical concepts.

Part 1: First Observed Next-Word Predictor

You are given a training corpus as a list of tokenized sentences and a list of query words. Build a next-word predictor by scanning the training sentences in order, left to right. For every adjacent pair word -> next_word, remember the first next_word ever seen after that word. For each query, return that remembered next word. If a word never appears with a following token, return None. This is a deterministic version of train once and predict many times.

Constraints

0 <= len(sentences) <= 100000
The total number of tokens across all sentences is at most 200000

Loading coding console...

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'like', 'ham', 'missing'])

Expected Output: ['am', 'I', 'green', None, None]

Explanation: The first observed followers are I -> am, Sam -> I, and like -> green. The word ham has no follower, and missing never appears.

Input: ([[], ['solo']], ['solo', 'anything'])

Expected Output: [None, None]

Explanation: There are no adjacent pairs at all, so every query returns None.

Input: ([['a', 'b', 'c'], ['a', 'd'], ['d']], ['a', 'b', 'd'])

Expected Output: ['b', 'c', None]

Explanation: The first observed follower of a is b. The follower of b is c. The word d never has a following token.

Input: ([['one', 'two']], [])

Expected Output: []

Explanation: No queries means the returned list is empty.

Part 2: O(1) Lookup for Most Frequent Next Word

You are given tokenized training sentences and query words. For each word, predict the immediate next word that appears most often after it in the training corpus. If several next words tie for the highest frequency, return the one whose first occurrence after that word appeared earliest in training order. Preprocess during training so that every query can be answered in O(1) time with a direct lookup.

Constraints

0 <= len(sentences) <= 100000
The total number of tokens across all sentences is at most 200000
0 <= len(queries) <= 100000
A sentence may be empty or contain one token
Each token is a non-empty string

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'Sam', 'am', 'ham', 'missing'])

Expected Output: ['am', 'I', 'Sam', None, None]

Explanation: I is followed by am twice and like once, so predict am. Sam is followed by I once. The word ham has no follower.

Input: ([['a', 'x'], ['a', 'y'], ['a', 'x'], ['a', 'y']], ['a', 'x', 'y'])

Expected Output: ['x', None, None]

Explanation: For a, both x and y occur twice, so the tie is broken by first appearance. x was seen first.

Input: ([[], ['solo']], ['solo', 'anything'])

Expected Output: [None, None]

Explanation: No word is ever followed by another token.

Input: ([['go', 'left', 'go', 'right', 'go', 'left']], ['go', 'left', 'right'])

Expected Output: ['left', 'go', 'go']

Explanation: go is followed by left twice and right once. left is followed by go once, and right is followed by go once.

Hints

If every query must be O(1), do not wait until query time to choose the best follower.
While training, track both the count of each follower and the tie-break information needed to know which follower is currently best.

Part 3: Weighted Random Next-Word Prediction

You are given tokenized training sentences and queries of the form (word, r), where 0 <= r < 1 is a pre-generated random number. For each word, count how often each immediate follower appears. Keep the distinct followers in the order they were first seen after that word. If the follower counts are c1, c2, ..., ck with total T, then choose follower i when r falls in the interval [sum of previous counts / T, sum through i / T). Return None if the word never appears with a follower. This deterministic interface simulates random prediction according to empirical frequencies.

Constraints

0 <= len(sentences) <= 100000
The total number of tokens across all sentences is at most 200000
0 <= len(queries) <= 100000
A sentence may be empty or contain one token
Each token is a non-empty string
Each query uses a float r such that 0 <= r < 1

Examples

Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], [('I', 0.0), ('I', 0.8), ('Sam', 0.2), ('ham', 0.3), ('missing', 0.1)])

Expected Output: ['am', 'like', 'I', None, None]

Explanation: For I, the follower counts are am: 2 and like: 1, in that order. So [0, 2/3) maps to am and [2/3, 1) maps to like.

Input: ([['a', 'x'], ['a', 'y'], ['a', 'y'], ['a', 'x']], [('a', 0.49), ('a', 0.5), ('x', 0.1)])

Expected Output: ['x', 'y', None]

Explanation: After a, x and y both have count 2. Because x appeared first, the intervals are x on [0, 0.5) and y on [0.5, 1).

Input: ([[], ['solo']], [('solo', 0.2), ('anything', 0.9)])

Expected Output: [None, None]

Explanation: No adjacent pairs exist, so every query returns None.

Input: ([['go', 'left', 'go', 'right', 'go', 'left']], [('go', 0.2), ('go', 0.9), ('left', 0.0), ('right', 0.0)])

Expected Output: ['left', 'right', 'go', 'go']

Explanation: For go, left has count 2 and right has count 1, so left covers the first two thirds of the interval and right covers the last third.

Hints

Weighted random selection can be modeled as choosing an interval on [0, 1).
Build cumulative counts for each word, then locate where r lands among those cumulative totals.