Predict most likely next word from training data

Q: Predict most likely next word from training data

This is a Coding & Algorithms interview question from Bloomberg for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Loading...

You are given tokenized training data consisting of multiple sentences (each sentence is a list of words). Build a lightweight model/data structure that can answer queries of the form: given a word w, return the most likely next word that follows w in the training data.

Example training data:

training_data = [["I", "am", "sam"], ["am", "sam"]]

From this data:

Query "I" → output "am"
Query "am" → output "sam"

Requirements

Training: Process training_data once to build the model.
Query: next_word(w) should return the most frequent word that immediately follows w across all sentences.
If w never appears followed by another word (e.g., w only appears at sentence end or never appears), return a sentinel such as null /empty string.
If there is a tie for most frequent next word, you may return any tied word (or specify a deterministic tie-break such as lexicographically smallest).

Follow-ups (discuss tradeoffs)

Can you do better than O(N) space in terms of total tokens N in the training set? Under what assumptions?
What other data structures could you use (e.g., hash maps vs. tries vs. compressed representations)?
Approximately how many unique words / transitions can you store given a memory budget (e.g., explain what dominates memory usage)?
How would you extend this to autocomplete an entire sentence , e.g., repeatedly predict the next word until an end-of-sentence token or a max length is reached?

Predict most likely next word from training data

Requirements

Follow-ups (discuss tradeoffs)

Comments (0)