Build next-word predictor with O(1) lookup

Q: Build next-word predictor with O(1) lookup

This question evaluates skills in language modeling, data structures, algorithmic optimization, and probabilistic sampling, within the Coding & Algorithms domain for a Data Scientist role, and primarily tests practical implementation and performance trade-offs rather than purely theoretical concepts.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Loading...

Problem

You are given a training corpus where each training example is a tokenized sentence (array of words). Example training sentences:

["I", "am", "Sam"]
["Sam", "I", "am"]
["I", "like", "green", "eggs", "and", "ham"]

Implement two functions:

train(sentences)
predict(word)

Base requirement

After calling train, predict(word) should return a possible next word that appears immediately after word somewhere in the training data.

If word never appears, or it never has a following word (e.g., it only appears as the last token), define and document a reasonable behavior (e.g., return None ).

Follow-up 1 (time complexity)

A straightforward solution stores word -> {next_word: count} and, during predict, scans candidates to decide what to return (time complexity O(k) where k is the number of distinct next-words).

Optimize so that each predict(word) call is O(1) time (not counting the cost of returning the string), by doing more work in train.

Follow-up 2 (probabilistic prediction)

Change predict(word) so it returns a next word randomly according to empirical frequencies from training.

Example: for the word "I", if the training data implies:

I -> am occurs 2 times
I -> like occurs 1 time

then predict("I") should return:

"am" with probability 2/3
"like" with probability 1/3

Notes / assumptions to clarify

Tokenization is already done (input is arrays of strings).
Punctuation/casing rules can be assumed consistent.
You may assume training is called once, then predict is called many times (so preprocessing is worthwhile).
State whether predict must be deterministic (it should not be for follow-up 2).