You are given a sequence of tokens (words) representing a corpus, e.g.:
tokens = [w0, w1, w2, ..., wK-1]
You need to build a next-word predictor that supports queries of the form:
-
predict(word) -> next_word
Where next_word is the word that most frequently follows word in the corpus.
More precisely:
-
For every adjacent pair
(tokens[i], tokens[i+1])
, count it as one occurrence of “
tokens[i+1]
follows
tokens[i]
”.
-
For a query
word = x
, return the
y
that maximizes
count(x, y)
.
-
If
x
never appears or never has a following word (e.g., it only appears as the last token), return a sentinel such as
null
/ empty string.
-
If there is a tie for max frequency, you may return any tied word (or specify a deterministic tie-break, e.g., lexicographically smallest).
Task
Design the data structures and implement:
-
A one-time
build(tokens)
preprocessing step.
-
A
predict(word)
query.
Constraints / expectations
-
The corpus is processed once (batch/offline build), then many queries may follow.
-
Aim for efficient preprocessing and fast queries.