Contrast LSTM and Transformer for long sequences

Q: Contrast LSTM and Transformer for long sequences

This is a Machine Learning interview question from TikTok for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Train a Long-Context Autoregressive LM (T = 8192, H = 512, B = 8)

You are training an autoregressive language model with:

Sequence length T = 8192 tokens
Hidden size H = 512
Batch size B = 8
Mixed precision fp16

Assume a standard Transformer uses 8 attention heads (so d_head = H / n_heads = 64) unless otherwise noted.

Tasks

Complexity and activation memory
- Compare per-layer time and memory complexity of an LSTM layer vs a vanilla Transformer self-attention layer.
- Estimate, to order-of-magnitude, the activation memory for a vanilla attention layer at B, H, T above (fp16). Show your arithmetic.
Architecture selection under a 24 GB GPU budget (no gradient checkpointing)
- Choose one: LSTM, Transformer with sliding-window attention (window W = 512), or a Performer/FlashAttention-like variant.
- Propose a concrete depth and justify the choice using asymptotic and constant-factor arguments, including a rough activation-memory budget per layer and total.
Positional encoding and extrapolation
- Explain how you would encode position (absolute learned/sinusoidal, relative bias, ALiBi, rotary/RoPE).
- Discuss how your choice affects extrapolation to T = 12,000 at inference.
Streaming inference plan
- Describe a streaming inference strategy for your chosen model: KV-cache management and chunking.
- Explain the latency implications (prefill vs per-token decode).

Contrast LSTM and Transformer for long sequences

Train a Long-Context Autoregressive LM (T = 8192, H = 512, B = 8)

Tasks

Solution (Locked)

Comments (0)