Implement SFT Sample Packing
Company: Microsoft
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: HR Screen
Quick Answer: This question evaluates proficiency in preprocessing for autoregressive language models, including deterministic sequence packing, construction of loss masks, segment range bookkeeping, answer span localization, and handling truncation and padding edge cases.
Constraints
- 0 <= len(samples) <= 2000
- 1 <= max_length <= 100000
- 0 <= len(prompt_tokens), len(answer_tokens) for every sample
- The total number of input tokens across all prompts and answers is at most 200000
- Token IDs, `eos_id`, and `pad_id` are integers
Examples
Input: ([{'prompt_tokens': [1, 2], 'answer_tokens': [3, 4]}, {'prompt_tokens': [5], 'answer_tokens': [6]}, {'prompt_tokens': [7, 8, 9], 'answer_tokens': [10]}], 8, 99, 0)
Expected Output: [{'input_ids': [1, 2, 3, 4, 99, 5, 6, 99], 'loss_mask': [0, 0, 1, 1, 1, 0, 1, 1], 'segment_ranges': [[0, 5], [5, 8]], 'answer_start_positions': [2, 6]}, {'input_ids': [7, 8, 9, 10, 99, 0, 0, 0], 'loss_mask': [0, 0, 0, 1, 1, 0, 0, 0], 'segment_ranges': [[0, 5]], 'answer_start_positions': [3]}]
Explanation: Example lengths are 5, 3, and 5. After sorting by length descending with stable tie handling, the two length-5 examples are considered first. The length-3 example fits into the first pack, producing one full pack and one padded pack.
Input: ([{'prompt_tokens': [1, 2, 3, 4], 'answer_tokens': [5, 6]}, {'prompt_tokens': [7], 'answer_tokens': [8, 9]}, {'prompt_tokens': [10, 11], 'answer_tokens': [12]}, {'prompt_tokens': [], 'answer_tokens': [13]}], 6, 99, 0)
Expected Output: [{'input_ids': [7, 8, 9, 99, 13, 99], 'loss_mask': [0, 1, 1, 1, 1, 1], 'segment_ranges': [[0, 4], [4, 6]], 'answer_start_positions': [1, 4]}, {'input_ids': [10, 11, 12, 99, 0, 0], 'loss_mask': [0, 0, 1, 1, 0, 0], 'segment_ranges': [[0, 4]], 'answer_start_positions': [2]}]
Explanation: The first sample is skipped because its example length is 7, which exceeds `max_length = 6`. The remaining examples have lengths 4, 4, and 2. The length-2 sample is packed into the first available pack with enough space.
Input: ([], 5, 99, 0)
Expected Output: []
Explanation: With no samples, there are no packed sequences to return.
Input: ([{'prompt_tokens': [1, 2], 'answer_tokens': []}, {'prompt_tokens': [3], 'answer_tokens': [4, 5]}, {'prompt_tokens': [], 'answer_tokens': []}], 4, 9, 0)
Expected Output: [{'input_ids': [3, 4, 5, 9], 'loss_mask': [0, 1, 1, 1], 'segment_ranges': [[0, 4]], 'answer_start_positions': [1]}, {'input_ids': [1, 2, 9, 9], 'loss_mask': [0, 0, 1, 1], 'segment_ranges': [[0, 3], [3, 4]], 'answer_start_positions': [2, 3]}]
Explanation: An empty answer still contributes the trailing EOS, which is supervised. The fully empty sample becomes just `[eos_id]` with loss mask `[1]`, and it is packed after the length-3 sample.
Hints
- Build each sample's tokens, loss mask, and answer start offset before you start packing.
- To make the packing deterministic, sort by descending example length and then scan existing packs from left to right to find the first one with enough remaining space.