Implement SFT Sample Packing

Q: Implement SFT Sample Packing

This question evaluates proficiency in preprocessing for autoregressive language models, including deterministic sequence packing, construction of loss masks, segment range bookkeeping, answer span localization, and handling truncation and padding edge cases.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Q: What difficulty level is this interview question?

This is a medium difficulty Coding & Algorithms question, commonly asked during HR Screen rounds at Microsoft.

Q: What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Microsoft during technical interviews.

Question

Loading...

Implement a preprocessing function for supervised fine-tuning data for an autoregressive language model.

You are given a list of tokenized training samples. Each sample contains:

prompt_tokens : a list of token IDs for the user prompt
answer_tokens : a list of token IDs for the target response

You are also given:

max_length : the fixed packed sequence length
eos_id : the end-of-sequence token ID
pad_id : the padding token ID

For each sample, first form a single training example as: prompt_tokens + answer_tokens + [eos_id]

Then pack multiple examples into fixed-length sequences using the following deterministic strategy:

Compute the length of each example.
Sort examples by descending length.
Place each example into the first packed sequence that still has enough remaining space; otherwise create a new packed sequence.
Pad every packed sequence to exactly max_length .

For each packed sequence, return:

input_ids : the packed token IDs of length max_length
loss_mask : a binary array of length max_length where prompt tokens and padding are 0 , and answer tokens plus the trailing eos_id are 1
segment_ranges : the [start, end) index range of every original sample inside the packed sequence, so downstream code can build a block-diagonal causal attention mask and prevent tokens from one sample from attending to another sample
answer_start_positions : the start index of each answer span in packed coordinates

Edge cases:

If a single sample is longer than max_length , handle it explicitly by either truncating it with a clearly defined policy or skipping it.
All indices must refer to positions inside the packed sequence before padding.

Implement the function and analyze the time and space complexity of your approach.

Implement SFT Sample Packing

Quick Overview

Comments (0)