PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates dynamic batching, per-request state management, and sequence-decoding correctness for language-model inference, including handling stop conditions, max-token limits, and maintaining a correct slot-to-request mapping.

  • medium
  • xAI
  • Coding & Algorithms
  • Machine Learning Engineer

Implement dynamic batching for token decoding

Company: xAI

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

You are given a black-box “simulated language model” interface that can advance many sequences in a batch. ## Model interface - Tokens are integers. - `model_next(batch_prefixes) -> next_tokens` - `batch_prefixes` is a list of token lists, one per active sequence in the current batch. - `next_tokens` is a list of integers of the same length, where `next_tokens[i]` is the next generated token for `batch_prefixes[i]`. ## Requests Each request/sequence has: - `prompt_tokens`: initial prefix tokens - `max_tokens`: maximum number of *generated* tokens allowed (not counting the prompt) - A stopping rule: - either a `stop_token` (single token), or - a `stop_sequence` (a list of tokens that, when it appears as a suffix of the generated output, ends generation) - A callback to return the final generated tokens (or you may collect results to return at the end). ## Batch execution requirement Implement a decoding/sampling engine with **dynamic batching**: - There is a fixed batch capacity `B`. - Requests arrive in a waiting queue (you can assume they are all available initially, or you can model an input queue). - You repeatedly call `model_next` to advance active sequences. - Sequences may finish at different times due to: - reaching `max_tokens`, or - hitting the stop condition. - When a sequence finishes, its slot becomes free and should be **refilled** from the waiting queue if possible. - Near the end, the batch may be partially filled; your code must handle `len(active) < B` correctly. ## Correctness requirement Maintain a correct mapping between **batch slots** and **requests** so that tokens and final outputs are never mixed up after refilling (e.g., via a `slot_id -> request_id` mapping). ## Task Write a function (or class) that runs this dynamic-batching decoding loop until all requests are completed, and returns (or callbacks) the generated outputs per request. Clearly define: - your data structures (active slots, waiting queue, per-request state), - the main loop and termination condition, - how you detect stop conditions (especially `stop_sequence`), - and how you handle partially filled batches.

Quick Answer: This question evaluates dynamic batching, per-request state management, and sequence-decoding correctness for language-model inference, including handling stop conditions, max-token limits, and maintaining a correct slot-to-request mapping.

You are given a deterministic simulated language model that can advance multiple sequences at once. Implement `solution(batch_size, requests, next_token_map)` to run a dynamic-batching decode loop. Each request is a dictionary with these keys: - `'prompt_tokens'`: list of integers, the initial prefix - `'max_tokens'`: maximum number of generated tokens allowed - `'stop_token'`: an integer stop token, or `None` - `'stop_sequence'`: a list of integers stop sequence, or `None` Exactly one of `'stop_token'` or `'stop_sequence'` will be non-`None`. A request with `max_tokens == 0` finishes immediately with an empty generated output. The simulated model is defined by `next_token_map`: for any active prefix, the next token is `next_token_map[tuple(prefix)]`. Conceptually, a black-box `model_next(batch_prefixes)` returns the next token for every currently active sequence. Your engine must use dynamic batching with a fixed batch capacity `batch_size`: - Keep waiting requests in the given order. - Fill empty batch slots from the waiting queue. - On each iteration, call the model once using only the currently active prefixes. - Different requests may finish at different times due to `max_tokens`, `stop_token`, or `stop_sequence`. - When a request finishes, free its slot and refill it before the next model call if another request is waiting. - Near the end, the batch may be partially filled; this must still work correctly. Return the generated tokens for every request in the original request order. If a `stop_token` or `stop_sequence` is hit, include the token(s) that triggered the stop in the returned output. The main challenge is maintaining a correct mapping between batch slots and request IDs so outputs are never mixed up after refilling.

Constraints

  • 1 <= batch_size <= 1000
  • 0 <= len(requests) <= 10000
  • 0 <= request['max_tokens'] <= 10000
  • Exactly one of request['stop_token'] or request['stop_sequence'] is non-None
  • If present, request['stop_sequence'] is non-empty
  • All tokens are integers
  • The sum of all generated tokens actually produced across all requests is at most 100000
  • Every prefix your engine queries exists in next_token_map

Examples

Input: (2, [{'prompt_tokens': [1], 'max_tokens': 4, 'stop_token': 4, 'stop_sequence': None}, {'prompt_tokens': [5], 'max_tokens': 5, 'stop_token': None, 'stop_sequence': [7, 8]}, {'prompt_tokens': [9], 'max_tokens': 2, 'stop_token': 99, 'stop_sequence': None}], {(1,): 2, (1, 2): 3, (1, 2, 3): 4, (5,): 6, (5, 6): 7, (5, 6, 7): 8, (9,): 10, (9, 10): 11})

Expected Output: [[2, 3, 4], [6, 7, 8], [10, 11]]

Explanation: Requests 0 and 1 start first. Both finish after the third model call, freeing slots. Request 2 is then inserted and runs alone. The stop token 4 and stop sequence [7, 8] are included in the outputs.

Input: (3, [{'prompt_tokens': [4], 'max_tokens': 0, 'stop_token': 9, 'stop_sequence': None}, {'prompt_tokens': [], 'max_tokens': 3, 'stop_token': None, 'stop_sequence': [7, 8]}, {'prompt_tokens': [2], 'max_tokens': 1, 'stop_token': 5, 'stop_sequence': None}], {(): 7, (7,): 8, (2,): 3})

Expected Output: [[], [7, 8], [3]]

Explanation: The first request finishes immediately because max_tokens is 0. The batch is partially filled. The second request stops when its generated suffix becomes [7, 8], and the third request stops after one token because of max_tokens.

Input: (2, [{'prompt_tokens': [1], 'max_tokens': 4, 'stop_token': 2, 'stop_sequence': None}, {'prompt_tokens': [1], 'max_tokens': 3, 'stop_token': None, 'stop_sequence': [2, 3]}, {'prompt_tokens': [1], 'max_tokens': 4, 'stop_token': 4, 'stop_sequence': None}, {'prompt_tokens': [5], 'max_tokens': 1, 'stop_token': 9, 'stop_sequence': None}], {(1,): 2, (1, 2): 3, (1, 2, 3): 4, (5,): 6})

Expected Output: [[2], [2, 3], [2, 3, 4], [6]]

Explanation: Multiple requests share the same prompt, so keeping the slot-to-request mapping correct matters. Request 0 stops after token 2, its slot is refilled, and later requests continue from their own correct states.

Input: (2, [], {})

Expected Output: []

Explanation: No requests means there is nothing to decode.

Solution

def solution(batch_size, requests, next_token_map):
    def model_next(batch_prefixes):
        next_tokens = []
        for prefix in batch_prefixes:
            key = tuple(prefix)
            if key not in next_token_map:
                raise KeyError(f'Missing next token for prefix {key}')
            next_tokens.append(next_token_map[key])
        return next_tokens

    n = len(requests)
    results = [[] for _ in range(n)]
    prefixes = [req['prompt_tokens'][:] for req in requests]
    active_slots = [None] * batch_size
    waiting = 0

    def fill_slot(slot):
        nonlocal waiting
        while waiting < n:
            if requests[waiting]['max_tokens'] == 0:
                waiting += 1
                continue
            active_slots[slot] = waiting
            waiting += 1
            return
        active_slots[slot] = None

    for slot in range(batch_size):
        fill_slot(slot)

    while True:
        batch_prefixes = []
        slot_order = []

        for slot, req_id in enumerate(active_slots):
            if req_id is not None:
                slot_order.append(slot)
                batch_prefixes.append(prefixes[req_id])

        if not batch_prefixes:
            break

        next_tokens = model_next(batch_prefixes)
        finished_slots = []

        for i, slot in enumerate(slot_order):
            req_id = active_slots[slot]
            token = next_tokens[i]

            prefixes[req_id].append(token)
            results[req_id].append(token)

            req = requests[req_id]
            done = len(results[req_id]) >= req['max_tokens']

            if not done and req.get('stop_token') is not None and token == req['stop_token']:
                done = True

            stop_sequence = req.get('stop_sequence')
            if not done and stop_sequence is not None:
                m = len(stop_sequence)
                if m <= len(results[req_id]) and results[req_id][-m:] == stop_sequence:
                    done = True

            if done:
                finished_slots.append(slot)

        for slot in finished_slots:
            active_slots[slot] = None

        for slot in finished_slots:
            fill_slot(slot)

    return results

Time complexity: O(T * L + C * B), where T is the total number of generated tokens, L is the maximum stop_sequence length, C is the number of model calls, and B is batch_size. Space complexity: O(P + T + B), where P is the total prompt length, T is the total generated output length, and B is batch_size.

Hints

  1. Use a fixed-size array for batch slots, where each slot stores the request ID currently occupying that slot.
  2. For a stop sequence, you only need to compare the end of the generated output after appending the newest token.
Last updated: May 24, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Flatten and unflatten nested Python structures - xAI (nan)
  • Compute dasher pay from order events - xAI (nan)
  • Compute total active time per Twitter Space - xAI (medium)
  • Design a Recoverable Iterator - xAI (medium)
  • Implement Distributed Matrix Multiplication - xAI (hard)