How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during HR Screen rounds at Microsoft.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Microsoft during technical interviews.

Implement SFT Sample Packing | Microsoft Coding Question

Quick Overview

This question evaluates proficiency in preprocessing for autoregressive language models, including deterministic sequence packing, construction of loss masks, segment range bookkeeping, answer span localization, and handling truncation and padding edge cases.

Implement SFT Sample Packing

Company: Microsoft

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: HR Screen

Implement a preprocessing function for supervised fine-tuning data for an autoregressive language model. You are given a list of tokenized training samples. Each sample contains: - `prompt_tokens`: a list of token IDs for the user prompt - `answer_tokens`: a list of token IDs for the target response You are also given: - `max_length`: the fixed packed sequence length - `eos_id`: the end-of-sequence token ID - `pad_id`: the padding token ID For each sample, first form a single training example as: `prompt_tokens + answer_tokens + [eos_id]` Then pack multiple examples into fixed-length sequences using the following deterministic strategy: 1. Compute the length of each example. 2. Sort examples by descending length. 3. Place each example into the first packed sequence that still has enough remaining space; otherwise create a new packed sequence. 4. Pad every packed sequence to exactly `max_length`. For each packed sequence, return: - `input_ids`: the packed token IDs of length `max_length` - `loss_mask`: a binary array of length `max_length` where prompt tokens and padding are `0`, and answer tokens plus the trailing `eos_id` are `1` - `segment_ranges`: the `[start, end)` index range of every original sample inside the packed sequence, so downstream code can build a block-diagonal causal attention mask and prevent tokens from one sample from attending to another sample - `answer_start_positions`: the start index of each answer span in packed coordinates Edge cases: - If a single sample is longer than `max_length`, handle it explicitly by either truncating it with a clearly defined policy or skipping it. - All indices must refer to positions inside the packed sequence before padding. Implement the function and analyze the time and space complexity of your approach.

Quick Answer: This question evaluates proficiency in preprocessing for autoregressive language models, including deterministic sequence packing, construction of loss masks, segment range bookkeeping, answer span localization, and handling truncation and padding edge cases.

Implement a preprocessing function for supervised fine-tuning data for an autoregressive language model. You are given a list of tokenized training samples. Each sample is a dictionary with: - `prompt_tokens`: token IDs for the user prompt - `answer_tokens`: token IDs for the target response For each sample, first build a single training example: `prompt_tokens + answer_tokens + [eos_id]` Then pack multiple examples into fixed-length sequences of length `max_length` using this deterministic first-fit decreasing strategy: 1. Compute each example length. 2. Sort examples by descending length. If two examples have the same length, keep their original input order. 3. Place each example into the first packed sequence that still has enough remaining space. 4. If no existing packed sequence can fit it, create a new packed sequence. 5. Pad every packed sequence to exactly `max_length` with `pad_id`. For each packed sequence, return a dictionary with: - `input_ids`: the packed token IDs of length `max_length` - `loss_mask`: a binary array of length `max_length` where prompt tokens and padding are `0`, and answer tokens plus the trailing `eos_id` are `1` - `segment_ranges`: a list of `[start, end)` ranges for every original sample inside the packed sequence - `answer_start_positions`: the start index of each answer span in packed coordinates Overlong samples must be handled explicitly. In this problem, if a single sample would produce an example longer than `max_length`, skip that sample entirely. All returned indices must refer to positions before padding, though they are still valid positions inside the final padded array.

Constraints

0 <= len(samples) <= 2000
1 <= max_length <= 100000
0 <= len(prompt_tokens), len(answer_tokens) for every sample
The total number of input tokens across all prompts and answers is at most 200000
Token IDs, `eos_id`, and `pad_id` are integers

Examples

Input: ([{'prompt_tokens': [1, 2], 'answer_tokens': [3, 4]}, {'prompt_tokens': [5], 'answer_tokens': [6]}, {'prompt_tokens': [7, 8, 9], 'answer_tokens': [10]}], 8, 99, 0)

Expected Output: [{'input_ids': [1, 2, 3, 4, 99, 5, 6, 99], 'loss_mask': [0, 0, 1, 1, 1, 0, 1, 1], 'segment_ranges': [[0, 5], [5, 8]], 'answer_start_positions': [2, 6]}, {'input_ids': [7, 8, 9, 10, 99, 0, 0, 0], 'loss_mask': [0, 0, 0, 1, 1, 0, 0, 0], 'segment_ranges': [[0, 5]], 'answer_start_positions': [3]}]

Explanation: Example lengths are 5, 3, and 5. After sorting by length descending with stable tie handling, the two length-5 examples are considered first. The length-3 example fits into the first pack, producing one full pack and one padded pack.

Input: ([{'prompt_tokens': [1, 2, 3, 4], 'answer_tokens': [5, 6]}, {'prompt_tokens': [7], 'answer_tokens': [8, 9]}, {'prompt_tokens': [10, 11], 'answer_tokens': [12]}, {'prompt_tokens': [], 'answer_tokens': [13]}], 6, 99, 0)

Expected Output: [{'input_ids': [7, 8, 9, 99, 13, 99], 'loss_mask': [0, 1, 1, 1, 1, 1], 'segment_ranges': [[0, 4], [4, 6]], 'answer_start_positions': [1, 4]}, {'input_ids': [10, 11, 12, 99, 0, 0], 'loss_mask': [0, 0, 1, 1, 0, 0], 'segment_ranges': [[0, 4]], 'answer_start_positions': [2]}]

Explanation: The first sample is skipped because its example length is 7, which exceeds `max_length = 6`. The remaining examples have lengths 4, 4, and 2. The length-2 sample is packed into the first available pack with enough space.

Input: ([], 5, 99, 0)

Expected Output: []

Explanation: With no samples, there are no packed sequences to return.

Input: ([{'prompt_tokens': [1, 2], 'answer_tokens': []}, {'prompt_tokens': [3], 'answer_tokens': [4, 5]}, {'prompt_tokens': [], 'answer_tokens': []}], 4, 9, 0)

Expected Output: [{'input_ids': [3, 4, 5, 9], 'loss_mask': [0, 1, 1, 1], 'segment_ranges': [[0, 4]], 'answer_start_positions': [1]}, {'input_ids': [1, 2, 9, 9], 'loss_mask': [0, 0, 1, 1], 'segment_ranges': [[0, 3], [3, 4]], 'answer_start_positions': [2, 3]}]

Explanation: An empty answer still contributes the trailing EOS, which is supervised. The fully empty sample becomes just `[eos_id]` with loss mask `[1]`, and it is packed after the length-3 sample.

Hints

Build each sample's tokens, loss mask, and answer start offset before you start packing.
To make the packing deterministic, sort by descending example length and then scan existing packs from left to right to find the first one with enough remaining space.

Quick Overview