Detect stop tokens during streaming inference
Company: Microsoft
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
## Problem: Stop-token / stop-sequence detection in streaming generation
During LLM inference you receive tokens incrementally (streaming). Implement logic that decides when to stop generation based on one or more **stop sequences**.
### Input
- A stream/iterator of generated token IDs (or strings).
- A list of stop sequences, where each stop sequence can be:
- a single token ID, or
- a list of token IDs representing a multi-token sequence (e.g., `[A, B, C]`).
### Output / behavior
- As tokens arrive, emit generated tokens **up to but not including** the first occurrence of any stop sequence.
- Stop as soon as any stop sequence is detected.
### Requirements
- Must work for overlapping matches (e.g., stop sequences `[1,2,1]` and stream `...1,2,1`).
- Must handle cases where a partial stop sequence appears at the end of the current buffer and completes with future tokens.
- Efficient: do not rescan the entire history per token.
### Clarifications
- If multiple stop sequences could match ending at the same position, stopping is immediate regardless of which matched.
- If the stream ends without a stop sequence, return all tokens.
Describe your approach and complexity; implement in your chosen language.
Quick Answer: This question evaluates streaming sequence-detection and pattern-matching skills, including online buffer management and correct handling of overlapping and partial multi-token stop sequences in LLM inference, and falls under the Coding & Algorithms category with domain focus on streaming algorithms for machine learning inference.