This question evaluates skills in language modeling, data structures, algorithmic optimization, and probabilistic sampling, within the Coding & Algorithms domain for a Data Scientist role, and primarily tests practical implementation and performance trade-offs rather than purely theoretical concepts.
You are given a training corpus where each training example is a tokenized sentence (array of words). Example training sentences:
["I", "am", "Sam"]
["Sam", "I", "am"]
["I", "like", "green", "eggs", "and", "ham"]
Implement two functions:
train(sentences)
predict(word)
After calling train, predict(word) should return a possible next word that appears immediately after word somewhere in the training data.
word
never appears, or it never has a following word (e.g., it only appears as the last token), define and document a reasonable behavior (e.g., return
None
).
A straightforward solution stores word -> {next_word: count} and, during predict, scans candidates to decide what to return (time complexity O(k) where k is the number of distinct next-words).
Optimize so that each predict(word) call is O(1) time (not counting the cost of returning the string), by doing more work in train.
Change predict(word) so it returns a next word randomly according to empirical frequencies from training.
Example: for the word "I", if the training data implies:
I -> am
occurs 2 times
I -> like
occurs 1 time
then predict("I") should return:
"am"
with probability
2/3
"like"
with probability
1/3
predict
must be deterministic (it should not be for follow-up 2).