This question evaluates dynamic batching, per-request state management, and sequence-decoding correctness for language-model inference, including handling stop conditions, max-token limits, and maintaining a correct slot-to-request mapping.
You are given a black-box “simulated language model” interface that can advance many sequences in a batch.
model_next(batch_prefixes) -> next_tokens
batch_prefixes
is a list of token lists, one per active sequence in the current batch.
next_tokens
is a list of integers of the same length, where
next_tokens[i]
is the next generated token for
batch_prefixes[i]
.
Each request/sequence has:
prompt_tokens
: initial prefix tokens
max_tokens
: maximum number of
generated
tokens allowed (not counting the prompt)
stop_token
(single token), or
stop_sequence
(a list of tokens that, when it appears as a suffix of the generated output, ends generation)
Implement a decoding/sampling engine with dynamic batching:
B
.
model_next
to advance active sequences.
max_tokens
, or
len(active) < B
correctly.
Maintain a correct mapping between batch slots and requests so that tokens and final outputs are never mixed up after refilling (e.g., via a slot_id -> request_id mapping).
Write a function (or class) that runs this dynamic-batching decoding loop until all requests are completed, and returns (or callbacks) the generated outputs per request.
Clearly define:
stop_sequence
),