You are given tokenized training data consisting of multiple sentences (each sentence is a list of words). Build a lightweight model/data structure that can answer queries of the form: given a word w, return the most likely next word that follows w in the training data.
Example training data:
training_data = [["I", "am", "sam"], ["am", "sam"]]
From this data:
"I"
→ output
"am"
"am"
→ output
"sam"
training_data
once to build the model.
next_word(w)
should return the most frequent word that immediately follows
w
across all sentences.
w
never appears followed by another word (e.g.,
w
only appears at sentence end or never appears), return a sentinel such as
null
/empty string.
N
in the training set? Under what assumptions?