How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at xAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at xAI during technical interviews.

Implement dynamic batching for token decoding

Company: xAI

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

You are given a black-box “simulated language model” interface that can advance many sequences in a batch. ## Model interface - Tokens are integers. - `model_next(batch_prefixes) -> next_tokens` - `batch_prefixes` is a list of token lists, one per active sequence in the current batch. - `next_tokens` is a list of integers of the same length, where `next_tokens[i]` is the next generated token for `batch_prefixes[i]`. ## Requests Each request/sequence has: - `prompt_tokens`: initial prefix tokens - `max_tokens`: maximum number of *generated* tokens allowed (not counting the prompt) - A stopping rule: - either a `stop_token` (single token), or - a `stop_sequence` (a list of tokens that, when it appears as a suffix of the generated output, ends generation) - A callback to return the final generated tokens (or you may collect results to return at the end). ## Batch execution requirement Implement a decoding/sampling engine with **dynamic batching**: - There is a fixed batch capacity `B`. - Requests arrive in a waiting queue (you can assume they are all available initially, or you can model an input queue). - You repeatedly call `model_next` to advance active sequences. - Sequences may finish at different times due to: - reaching `max_tokens`, or - hitting the stop condition. - When a sequence finishes, its slot becomes free and should be **refilled** from the waiting queue if possible. - Near the end, the batch may be partially filled; your code must handle `len(active) < B` correctly. ## Correctness requirement Maintain a correct mapping between **batch slots** and **requests** so that tokens and final outputs are never mixed up after refilling (e.g., via a `slot_id -> request_id` mapping). ## Task Write a function (or class) that runs this dynamic-batching decoding loop until all requests are completed, and returns (or callbacks) the generated outputs per request. Clearly define: - your data structures (active slots, waiting queue, per-request state), - the main loop and termination condition, - how you detect stop conditions (especially `stop_sequence`), - and how you handle partially filled batches.

Quick Answer: This question evaluates dynamic batching, per-request state management, and sequence-decoding correctness for language-model inference, including handling stop conditions, max-token limits, and maintaining a correct slot-to-request mapping.

You are given a deterministic simulated language model that can advance multiple sequences at once. Implement `solution(batch_size, requests, next_token_map)` to run a dynamic-batching decode loop. Each request is a dictionary with these keys: - `'prompt_tokens'`: list of integers, the initial prefix - `'max_tokens'`: maximum number of generated tokens allowed - `'stop_token'`: an integer stop token, or `None` - `'stop_sequence'`: a list of integers stop sequence, or `None` Exactly one of `'stop_token'` or `'stop_sequence'` will be non-`None`. A request with `max_tokens == 0` finishes immediately with an empty generated output. The simulated model is defined by `next_token_map`: for any active prefix, the next token is `next_token_map[tuple(prefix)]`. Conceptually, a black-box `model_next(batch_prefixes)` returns the next token for every currently active sequence. Your engine must use dynamic batching with a fixed batch capacity `batch_size`: - Keep waiting requests in the given order. - Fill empty batch slots from the waiting queue. - On each iteration, call the model once using only the currently active prefixes. - Different requests may finish at different times due to `max_tokens`, `stop_token`, or `stop_sequence`. - When a request finishes, free its slot and refill it before the next model call if another request is waiting. - Near the end, the batch may be partially filled; this must still work correctly. Return the generated tokens for every request in the original request order. If a `stop_token` or `stop_sequence` is hit, include the token(s) that triggered the stop in the returned output. The main challenge is maintaining a correct mapping between batch slots and request IDs so outputs are never mixed up after refilling.

Constraints

1 <= batch_size <= 1000
0 <= len(requests) <= 10000
0 <= request['max_tokens'] <= 10000
Exactly one of request['stop_token'] or request['stop_sequence'] is non-None
If present, request['stop_sequence'] is non-empty
All tokens are integers
The sum of all generated tokens actually produced across all requests is at most 100000
Every prefix your engine queries exists in next_token_map

Examples

Input: (2, [{'prompt_tokens': [1], 'max_tokens': 4, 'stop_token': 4, 'stop_sequence': None}, {'prompt_tokens': [5], 'max_tokens': 5, 'stop_token': None, 'stop_sequence': [7, 8]}, {'prompt_tokens': [9], 'max_tokens': 2, 'stop_token': 99, 'stop_sequence': None}], {(1,): 2, (1, 2): 3, (1, 2, 3): 4, (5,): 6, (5, 6): 7, (5, 6, 7): 8, (9,): 10, (9, 10): 11})

Expected Output: [[2, 3, 4], [6, 7, 8], [10, 11]]

Explanation: Requests 0 and 1 start first. Both finish after the third model call, freeing slots. Request 2 is then inserted and runs alone. The stop token 4 and stop sequence [7, 8] are included in the outputs.

Input: (3, [{'prompt_tokens': [4], 'max_tokens': 0, 'stop_token': 9, 'stop_sequence': None}, {'prompt_tokens': [], 'max_tokens': 3, 'stop_token': None, 'stop_sequence': [7, 8]}, {'prompt_tokens': [2], 'max_tokens': 1, 'stop_token': 5, 'stop_sequence': None}], {(): 7, (7,): 8, (2,): 3})

Expected Output: [[], [7, 8], [3]]

Explanation: The first request finishes immediately because max_tokens is 0. The batch is partially filled. The second request stops when its generated suffix becomes [7, 8], and the third request stops after one token because of max_tokens.

Hints

Use a fixed-size array for batch slots, where each slot stores the request ID currently occupying that slot.
For a stop sequence, you only need to compare the end of the generated output after appending the newest token.

Quick Overview

This question evaluates dynamic batching, per-request state management, and sequence-decoding correctness for language-model inference, including handling stop conditions, max-token limits, and maintaining a correct slot-to-request mapping.