PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates proficiency in frequency-based language modeling and data-structure design, focusing on token sequence analysis, frequency counting, and handling boundary conditions.

  • medium
  • Bloomberg
  • Coding & Algorithms
  • Software Engineer

Predict most likely next word from training data

Company: Bloomberg

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

You are given tokenized training data consisting of multiple sentences (each sentence is a list of words). Build a lightweight model/data structure that can answer queries of the form: given a word `w`, return the **most likely next word** that follows `w` in the training data. Example training data: ```text training_data = [["I", "am", "sam"], ["am", "sam"]] ``` From this data: - Query `"I"` → output `"am"` - Query `"am"` → output `"sam"` ### Requirements 1. **Training:** Process `training_data` once to build the model. 2. **Query:** `next_word(w)` should return the most frequent word that immediately follows `w` across all sentences. 3. If `w` never appears followed by another word (e.g., `w` only appears at sentence end or never appears), return a sentinel such as `null`/empty string. 4. If there is a tie for most frequent next word, you may return any tied word (or specify a deterministic tie-break such as lexicographically smallest). ### Follow-ups (discuss tradeoffs) - Can you do better than **O(N)** space in terms of total tokens `N` in the training set? Under what assumptions? - What other data structures could you use (e.g., hash maps vs. tries vs. compressed representations)? - Approximately how many unique words / transitions can you store given a memory budget (e.g., explain what dominates memory usage)? - How would you extend this to **autocomplete an entire sentence**, e.g., repeatedly predict the next word until an end-of-sentence token or a max length is reached?

Quick Answer: This question evaluates proficiency in frequency-based language modeling and data-structure design, focusing on token sequence analysis, frequency counting, and handling boundary conditions.

You are given tokenized training data `training_data` — a list of sentences, where each sentence is a list of words (strings). Build a model (e.g., a bigram frequency map) from this data, then answer a batch of queries. Implement `solution(training_data, queries)` that returns a list of answers, one per word in `queries`. For each query word `w`, return the word that most frequently immediately follows `w` across all sentences in the training data. Rules: 1. Count, for every adjacent pair `(a, b)` within a sentence, how often `b` immediately follows `a`. 2. For a query word `w`, return the next word with the highest follow-count. 3. If `w` never appears immediately followed by another word (it only appears at the end of sentences, or never appears at all), return the empty string `""` as the sentinel. 4. Tie-break: if two or more candidate next words share the highest count, return the lexicographically smallest one (so the answer is deterministic). Example: `training_data = [["I", "am", "sam"], ["am", "sam"]]`, `queries = ["I", "am", "sam"]` returns `["am", "sam", ""]` — `"I"` is followed by `"am"`, `"am"` is followed by `"sam"` (twice), and `"sam"` never has a following word.

Constraints

  • 0 <= number of sentences <= 10^4
  • 0 <= words per sentence <= 10^4
  • Words are non-empty lowercase/mixed-case ASCII strings
  • Sentinel for 'no next word' is the empty string ""
  • Ties resolved by lexicographically smallest next word

Examples

Input: ([["I", "am", "sam"], ["am", "sam"]], ["I", "am", "sam"])

Expected Output: ["am", "sam", ""]

Explanation: "I"->"am" (1), "am"->"sam" (2 total), "sam" never has a following word so it returns the sentinel.

Input: ([["the", "cat", "the", "dog", "the", "cat"]], ["the"])

Expected Output: ["cat"]

Explanation: "the" is followed by "cat" twice and "dog" once, so the most frequent next word is "cat".

Input: ([["a", "b"], ["a", "c"]], ["a"])

Expected Output: ["b"]

Explanation: "a"->"b" and "a"->"c" both have count 1; the tie is broken lexicographically, so "b" wins.

Input: ([], ["x"])

Expected Output: [""]

Explanation: Empty training data yields no transitions, so the query returns the empty-string sentinel.

Input: ([["hello"]], ["hello", "world"])

Expected Output: ["", ""]

Explanation: "hello" appears only at sentence end (no following word) and "world" never appears, so both return the sentinel.

Input: ([["x", "y", "z", "y", "w"], ["x", "y"]], ["x", "y", "z", "w"])

Expected Output: ["y", "w", "y", ""]

Explanation: "x"->"y" (2). "y"->"z" (1) and "y"->"w" (1) tie, lexicographic pick is "w". "z"->"y" (1). "w" is only at an end, so it returns the sentinel.

Hints

  1. Make one pass over every sentence and, for each adjacent pair (a, b), increment a nested count map: counts[a][b] += 1.
  2. To answer a query for word w, scan counts[w] and pick the entry with the highest count; on equal counts prefer the lexicographically smaller word.
  3. If w isn't a key in your map (or it appears only at sentence ends), there are no transitions — return the empty string "".
Last updated: Jun 26, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Minimize Travel Assignment Cost - Bloomberg (medium)
  • Determine Balloon Popping Time - Bloomberg (medium)
  • Solve meeting and tree problems - Bloomberg (easy)
  • Minimize travel cost with two cities - Bloomberg (easy)
  • Check connectivity between two subway stations - Bloomberg (easy)