This question evaluates understanding of n-gram language modeling, probabilistic reasoning, frequency counting, and data structure design for efficient lookup, focusing on coding and algorithms applied to a natural language processing task.
Implement a simple next-word model over tokenized training sentences.
You need to write two functions:
train(sentences)
: receives a list of tokenized sentences, for example:
["I", "am", "Sam"]
["Sam", "I", "am"]
["I", "like", "green", "eggs", "and", "ham"]
predict(word)
: given a word, return the most likely word that immediately follows it in the training corpus.
Assume the model is based only on adjacent word pairs observed in training. For the example above, predict("I") should return "am" in the base version because I -> am appears twice while I -> like appears once.
Clarify and handle reasonable edge cases such as words never seen in training, words that only appear at the end of a sentence, and ties.
Follow-up 1: In a naive implementation, predict(word) may take O(k) time where k is the number of distinct next words seen after that word. How would you redesign training so that prediction becomes O(1) average time?
Follow-up 2: Modify predict(word) so it returns a random next word according to the empirical distribution of observed next words. For example, if I -> am occurs 2 times and I -> like occurs 1 time, then predict("I") should return "am" with probability 2/3 and "like" with probability 1/3.