How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Bloomberg.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Bloomberg during technical interviews.

Predict most likely next word from training data

Company: Bloomberg

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

You are given tokenized training data consisting of multiple sentences (each sentence is a list of words). Build a lightweight model/data structure that can answer queries of the form: given a word `w`, return the **most likely next word** that follows `w` in the training data. Example training data: ```text training_data = [["I", "am", "sam"], ["am", "sam"]] ``` From this data: - Query `"I"` → output `"am"` - Query `"am"` → output `"sam"` ### Requirements 1. **Training:** Process `training_data` once to build the model. 2. **Query:** `next_word(w)` should return the most frequent word that immediately follows `w` across all sentences. 3. If `w` never appears followed by another word (e.g., `w` only appears at sentence end or never appears), return a sentinel such as `null`/empty string. 4. If there is a tie for most frequent next word, you may return any tied word (or specify a deterministic tie-break such as lexicographically smallest). ### Follow-ups (discuss tradeoffs) - Can you do better than **O(N)** space in terms of total tokens `N` in the training set? Under what assumptions? - What other data structures could you use (e.g., hash maps vs. tries vs. compressed representations)? - Approximately how many unique words / transitions can you store given a memory budget (e.g., explain what dominates memory usage)? - How would you extend this to **autocomplete an entire sentence**, e.g., repeatedly predict the next word until an end-of-sentence token or a max length is reached?

Quick Answer: This question evaluates proficiency in frequency-based language modeling and data-structure design, focusing on token sequence analysis, frequency counting, and handling boundary conditions.

You are given tokenized training data `training_data` — a list of sentences, where each sentence is a list of words (strings). Build a model (e.g., a bigram frequency map) from this data, then answer a batch of queries. Implement `solution(training_data, queries)` that returns a list of answers, one per word in `queries`. For each query word `w`, return the word that most frequently immediately follows `w` across all sentences in the training data. Rules: 1. Count, for every adjacent pair `(a, b)` within a sentence, how often `b` immediately follows `a`. 2. For a query word `w`, return the next word with the highest follow-count. 3. If `w` never appears immediately followed by another word (it only appears at the end of sentences, or never appears at all), return the empty string `""` as the sentinel. 4. Tie-break: if two or more candidate next words share the highest count, return the lexicographically smallest one (so the answer is deterministic). Example: `training_data = [["I", "am", "sam"], ["am", "sam"]]`, `queries = ["I", "am", "sam"]` returns `["am", "sam", ""]` — `"I"` is followed by `"am"`, `"am"` is followed by `"sam"` (twice), and `"sam"` never has a following word.

Constraints

0 <= number of sentences <= 10^4
0 <= words per sentence <= 10^4
Words are non-empty lowercase/mixed-case ASCII strings
Sentinel for 'no next word' is the empty string ""
Ties resolved by lexicographically smallest next word

Examples

Input: ([["I", "am", "sam"], ["am", "sam"]], ["I", "am", "sam"])

Expected Output: ["am", "sam", ""]

Explanation: "I"->"am" (1), "am"->"sam" (2 total), "sam" never has a following word so it returns the sentinel.

Input: ([["the", "cat", "the", "dog", "the", "cat"]], ["the"])

Expected Output: ["cat"]

Explanation: "the" is followed by "cat" twice and "dog" once, so the most frequent next word is "cat".

Input: ([["a", "b"], ["a", "c"]], ["a"])

Expected Output: ["b"]

Explanation: "a"->"b" and "a"->"c" both have count 1; the tie is broken lexicographically, so "b" wins.

Input: ([], ["x"])

Expected Output: [""]

Explanation: Empty training data yields no transitions, so the query returns the empty-string sentinel.

Input: ([["hello"]], ["hello", "world"])

Expected Output: ["", ""]

Explanation: "hello" appears only at sentence end (no following word) and "world" never appears, so both return the sentinel.

Input: ([["x", "y", "z", "y", "w"], ["x", "y"]], ["x", "y", "z", "w"])

Expected Output: ["y", "w", "y", ""]

Explanation: "x"->"y" (2). "y"->"z" (1) and "y"->"w" (1) tie, lexicographic pick is "w". "z"->"y" (1). "w" is only at an end, so it returns the sentinel.

Hints

Make one pass over every sentence and, for each adjacent pair (a, b), increment a nested count map: counts[a][b] += 1.
To answer a query for word w, scan counts[w] and pick the entry with the highest count; on equal counts prefer the lexicographically smaller word.
If w isn't a key in your map (or it appears only at sentence ends), there are no transitions — return the empty string "".

Quick Overview

This question evaluates proficiency in frequency-based language modeling and data-structure design, focusing on token sequence analysis, frequency counting, and handling boundary conditions.