Contrast LSTM and Transformer for long sequences
Company: TikTok
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You must train an autoregressive language model on sequences of length T=8192 tokens, hidden size H=512, batch size B=8. (1) Compare the per-layer time and memory complexity for an LSTM vs a vanilla Transformer self-attention layer; estimate, to order-of-magnitude, the activation memory for attention (assume fp16) at this B,H,T. Show your arithmetic. (2) Propose an architecture that fits in 24 GB GPU memory without gradient checkpointing: pick between LSTM, Transformer with sliding-window attention (window 512), or a Performer/FlashAttention-like variant; justify with asymptotic and constant-factor considerations. (3) Explain how you would encode position (absolute, relative, ALiBi, rotary) and how that choice affects extrapolation to T=12,000 at inference. (4) Describe a streaming inference strategy (KV-cache management and chunking) for the chosen model and its latency implications.
Quick Answer: This question evaluates understanding of sequence-model architectures and system-level trade-offs for long-context autoregressive language models, covering computational and activation memory complexity, positional encoding and extrapolation behavior, and streaming inference and KV-cache management within GPU constraints.