You are given tokenized training sentences and query words. Build a next-word model using adjacent word pairs within the same sentence. For each query, return a random next word according to the empirical distribution observed in training.
Example: if 'I' is followed by 'am' 2 times and 'like' 1 time, then predictions from 'I' should choose 'am' with probability 2/3 and 'like' with probability 1/3.
To make the output deterministic for testing, use this exact seeded pseudo-random process:
1. Maintain an integer state, initially equal to seed.
2. For each query word that has at least one possible next word, update:
state = (1103515245 * state + 12345) % 2147483648
3. Let r = state % total_count(word), where total_count(word) is the total number of observed followers of that word.
4. Sort the candidate next words for that word lexicographically, build cumulative counts in that order, and return the first word whose cumulative count is greater than r.
5. If a query word has no possible next word, return None and do not advance the state.
Implement everything inside one function.
Examples
Input: ([['I', 'am', 'Sam'], ['Sam', 'I', 'am'], ['I', 'like', 'green', 'eggs', 'and', 'ham']], ['I', 'I', 'Sam', 'ham'], 1)
Expected Output: ['am', 'am', 'I', None]
Explanation: For 'I', the distribution is {'am': 2, 'like': 1}. Using the required generator with seed 1 produces 'am', then 'am', then 'I' for 'Sam'.
Input: ([['a', 'c'], ['a', 'b']], ['a', 'a', 'a'], 0)
Expected Output: ['c', 'b', 'c']
Explanation: 'a' has one count each for 'b' and 'c'. With lexicographic order ['b', 'c'] and the specified RNG, the sampled sequence is c, b, c.
Input: ([['solo']], ['solo', 'x'], 5)
Expected Output: [None, None]
Explanation: Neither query has any outgoing pair, so both return None and the RNG state never advances.
Input: ([['a', 'b'], ['a', 'c']], ['x', 'a', 'a'], 0)
Expected Output: [None, 'c', 'b']
Explanation: The first query has no prediction, so the seed is unchanged before the next two weighted samples.