You are given a sequence of tokens (words) representing a corpus, e.g.:
tokens = [w0, w1, w2, ..., wK-1]
You need to build a next-word predictor that supports queries of the form:
predict(word) -> next_word
Where next_word is the word that most frequently follows word in the corpus.
More precisely:
(tokens[i], tokens[i+1])
, count it as one occurrence of “
tokens[i+1]
follows
tokens[i]
”.
word = x
, return the
y
that maximizes
count(x, y)
.
x
never appears or never has a following word (e.g., it only appears as the last token), return a sentinel such as
null
/ empty string.
Design the data structures and implement:
build(tokens)
preprocessing step.
predict(word)
query.