PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates proficiency in preprocessing for autoregressive language models, including deterministic sequence packing, construction of loss masks, segment range bookkeeping, answer span localization, and handling truncation and padding edge cases.

  • medium
  • Microsoft
  • Coding & Algorithms
  • Machine Learning Engineer

Implement SFT Sample Packing

Company: Microsoft

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: HR Screen

Implement a preprocessing function for supervised fine-tuning data for an autoregressive language model. You are given a list of tokenized training samples. Each sample contains: - `prompt_tokens`: a list of token IDs for the user prompt - `answer_tokens`: a list of token IDs for the target response You are also given: - `max_length`: the fixed packed sequence length - `eos_id`: the end-of-sequence token ID - `pad_id`: the padding token ID For each sample, first form a single training example as: `prompt_tokens + answer_tokens + [eos_id]` Then pack multiple examples into fixed-length sequences using the following deterministic strategy: 1. Compute the length of each example. 2. Sort examples by descending length. 3. Place each example into the first packed sequence that still has enough remaining space; otherwise create a new packed sequence. 4. Pad every packed sequence to exactly `max_length`. For each packed sequence, return: - `input_ids`: the packed token IDs of length `max_length` - `loss_mask`: a binary array of length `max_length` where prompt tokens and padding are `0`, and answer tokens plus the trailing `eos_id` are `1` - `segment_ranges`: the `[start, end)` index range of every original sample inside the packed sequence, so downstream code can build a block-diagonal causal attention mask and prevent tokens from one sample from attending to another sample - `answer_start_positions`: the start index of each answer span in packed coordinates Edge cases: - If a single sample is longer than `max_length`, handle it explicitly by either truncating it with a clearly defined policy or skipping it. - All indices must refer to positions inside the packed sequence before padding. Implement the function and analyze the time and space complexity of your approach.

Quick Answer: This question evaluates proficiency in preprocessing for autoregressive language models, including deterministic sequence packing, construction of loss masks, segment range bookkeeping, answer span localization, and handling truncation and padding edge cases.

Implement a preprocessing function for supervised fine-tuning data for an autoregressive language model. You are given a list of tokenized training samples. Each sample is a dictionary with: - `prompt_tokens`: token IDs for the user prompt - `answer_tokens`: token IDs for the target response For each sample, first build a single training example: `prompt_tokens + answer_tokens + [eos_id]` Then pack multiple examples into fixed-length sequences of length `max_length` using this deterministic first-fit decreasing strategy: 1. Compute each example length. 2. Sort examples by descending length. If two examples have the same length, keep their original input order. 3. Place each example into the first packed sequence that still has enough remaining space. 4. If no existing packed sequence can fit it, create a new packed sequence. 5. Pad every packed sequence to exactly `max_length` with `pad_id`. For each packed sequence, return a dictionary with: - `input_ids`: the packed token IDs of length `max_length` - `loss_mask`: a binary array of length `max_length` where prompt tokens and padding are `0`, and answer tokens plus the trailing `eos_id` are `1` - `segment_ranges`: a list of `[start, end)` ranges for every original sample inside the packed sequence - `answer_start_positions`: the start index of each answer span in packed coordinates Overlong samples must be handled explicitly. In this problem, if a single sample would produce an example longer than `max_length`, skip that sample entirely. All returned indices must refer to positions before padding, though they are still valid positions inside the final padded array.

Constraints

  • 0 <= len(samples) <= 2000
  • 1 <= max_length <= 100000
  • 0 <= len(prompt_tokens), len(answer_tokens) for every sample
  • The total number of input tokens across all prompts and answers is at most 200000
  • Token IDs, `eos_id`, and `pad_id` are integers

Examples

Input: ([{'prompt_tokens': [1, 2], 'answer_tokens': [3, 4]}, {'prompt_tokens': [5], 'answer_tokens': [6]}, {'prompt_tokens': [7, 8, 9], 'answer_tokens': [10]}], 8, 99, 0)

Expected Output: [{'input_ids': [1, 2, 3, 4, 99, 5, 6, 99], 'loss_mask': [0, 0, 1, 1, 1, 0, 1, 1], 'segment_ranges': [[0, 5], [5, 8]], 'answer_start_positions': [2, 6]}, {'input_ids': [7, 8, 9, 10, 99, 0, 0, 0], 'loss_mask': [0, 0, 0, 1, 1, 0, 0, 0], 'segment_ranges': [[0, 5]], 'answer_start_positions': [3]}]

Explanation: Example lengths are 5, 3, and 5. After sorting by length descending with stable tie handling, the two length-5 examples are considered first. The length-3 example fits into the first pack, producing one full pack and one padded pack.

Input: ([{'prompt_tokens': [1, 2, 3, 4], 'answer_tokens': [5, 6]}, {'prompt_tokens': [7], 'answer_tokens': [8, 9]}, {'prompt_tokens': [10, 11], 'answer_tokens': [12]}, {'prompt_tokens': [], 'answer_tokens': [13]}], 6, 99, 0)

Expected Output: [{'input_ids': [7, 8, 9, 99, 13, 99], 'loss_mask': [0, 1, 1, 1, 1, 1], 'segment_ranges': [[0, 4], [4, 6]], 'answer_start_positions': [1, 4]}, {'input_ids': [10, 11, 12, 99, 0, 0], 'loss_mask': [0, 0, 1, 1, 0, 0], 'segment_ranges': [[0, 4]], 'answer_start_positions': [2]}]

Explanation: The first sample is skipped because its example length is 7, which exceeds `max_length = 6`. The remaining examples have lengths 4, 4, and 2. The length-2 sample is packed into the first available pack with enough space.

Input: ([], 5, 99, 0)

Expected Output: []

Explanation: With no samples, there are no packed sequences to return.

Input: ([{'prompt_tokens': [1, 2], 'answer_tokens': []}, {'prompt_tokens': [3], 'answer_tokens': [4, 5]}, {'prompt_tokens': [], 'answer_tokens': []}], 4, 9, 0)

Expected Output: [{'input_ids': [3, 4, 5, 9], 'loss_mask': [0, 1, 1, 1], 'segment_ranges': [[0, 4]], 'answer_start_positions': [1]}, {'input_ids': [1, 2, 9, 9], 'loss_mask': [0, 0, 1, 1], 'segment_ranges': [[0, 3], [3, 4]], 'answer_start_positions': [2, 3]}]

Explanation: An empty answer still contributes the trailing EOS, which is supervised. The fully empty sample becomes just `[eos_id]` with loss mask `[1]`, and it is packed after the length-3 sample.

Hints

  1. Build each sample's tokens, loss mask, and answer start offset before you start packing.
  2. To make the packing deterministic, sort by descending example length and then scan existing packs from left to right to find the first one with enough remaining space.
Last updated: May 7, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Return Top K Open Businesses - Microsoft (hard)
  • Implement Memory Allocation and In-Memory Records - Microsoft (medium)
  • Implement K-Means and Detect Divisible Subarrays - Microsoft (medium)
  • Sort Three Categories In Place - Microsoft (medium)
  • Retain Top K Elements - Microsoft (medium)