Implement a preprocessing function for supervised fine-tuning data for an autoregressive language model.
You are given a list of tokenized training samples. Each sample contains:
-
prompt_tokens
: a list of token IDs for the user prompt
-
answer_tokens
: a list of token IDs for the target response
You are also given:
-
max_length
: the fixed packed sequence length
-
eos_id
: the end-of-sequence token ID
-
pad_id
: the padding token ID
For each sample, first form a single training example as:
prompt_tokens + answer_tokens + [eos_id]
Then pack multiple examples into fixed-length sequences using the following deterministic strategy:
-
Compute the length of each example.
-
Sort examples by descending length.
-
Place each example into the first packed sequence that still has enough remaining space; otherwise create a new packed sequence.
-
Pad every packed sequence to exactly
max_length
.
For each packed sequence, return:
-
input_ids
: the packed token IDs of length
max_length
-
loss_mask
: a binary array of length
max_length
where prompt tokens and padding are
0
, and answer tokens plus the trailing
eos_id
are
1
-
segment_ranges
: the
[start, end)
index range of every original sample inside the packed sequence, so downstream code can build a block-diagonal causal attention mask and prevent tokens from one sample from attending to another sample
-
answer_start_positions
: the start index of each answer span in packed coordinates
Edge cases:
-
If a single sample is longer than
max_length
, handle it explicitly by either truncating it with a clearly defined policy or skipping it.
-
All indices must refer to positions inside the packed sequence before padding.
Implement the function and analyze the time and space complexity of your approach.