Contrast LSTM and Transformer for long sequences

Q: Contrast LSTM and Transformer for long sequences

This question evaluates understanding of sequence-model architectures and system-level trade-offs for long-context autoregressive language models, covering computational and activation memory complexity, positional encoding and extrapolation behavior, and streaming inference and KV-cache management within GPU constraints.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty Machine Learning question, commonly asked during Technical Screen rounds at TikTok.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at TikTok during technical interviews.

Question

Loading...

Train a Long-Context Autoregressive LM (T = 8192, H = 512, B = 8)

You are training an autoregressive language model with:

Sequence length T = 8192 tokens
Hidden size H = 512
Batch size B = 8
Mixed precision fp16

Assume a standard Transformer uses 8 attention heads (so d_head = H / n_heads = 64) unless otherwise noted.

Tasks

Complexity and activation memory
- Compare per-layer time and memory complexity of an LSTM layer vs a vanilla Transformer self-attention layer.
- Estimate, to order-of-magnitude, the activation memory for a vanilla attention layer at B, H, T above (fp16). Show your arithmetic.
Architecture selection under a 24 GB GPU budget (no gradient checkpointing)
- Choose one: LSTM, Transformer with sliding-window attention (window W = 512), or a Performer/FlashAttention-like variant.
- Propose a concrete depth and justify the choice using asymptotic and constant-factor arguments, including a rough activation-memory budget per layer and total.
Positional encoding and extrapolation
- Explain how you would encode position (absolute learned/sinusoidal, relative bias, ALiBi, rotary/RoPE).
- Discuss how your choice affects extrapolation to T = 12,000 at inference.
Streaming inference plan
- Describe a streaming inference strategy for your chosen model: KV-cache management and chunking.
- Explain the latency implications (prefill vs per-token decode).

Contrast LSTM and Transformer for long sequences

Quick Overview

Train a Long-Context Autoregressive LM (T = 8192, H = 512, B = 8)

Tasks

Solution

Comments (0)