You are given tokenized training data consisting of multiple sentences (each sentence is a list of words). Build a lightweight model/data structure that can answer queries of the form: given a word w, return the most likely next word that follows w in the training data.
Example training data:
training_data = [["I", "am", "sam"], ["am", "sam"]]
From this data:
-
Query
"I"
→ output
"am"
-
Query
"am"
→ output
"sam"
Requirements
-
Training:
Process
training_data
once to build the model.
-
Query:
next_word(w)
should return the most frequent word that immediately follows
w
across all sentences.
-
If
w
never appears followed by another word (e.g.,
w
only appears at sentence end or never appears), return a sentinel such as
null
/empty string.
-
If there is a tie for most frequent next word, you may return any tied word (or specify a deterministic tie-break such as lexicographically smallest).
Follow-ups (discuss tradeoffs)
-
Can you do better than
O(N)
space in terms of total tokens
N
in the training set? Under what assumptions?
-
What other data structures could you use (e.g., hash maps vs. tries vs. compressed representations)?
-
Approximately how many unique words / transitions can you store given a memory budget (e.g., explain what dominates memory usage)?
-
How would you extend this to
autocomplete an entire sentence
, e.g., repeatedly predict the next word until an end-of-sentence token or a max length is reached?