Train a Long-Context Autoregressive LM (T = 8192, H = 512, B = 8)
You are training an autoregressive language model with:
-
Sequence length T = 8192 tokens
-
Hidden size H = 512
-
Batch size B = 8
-
Mixed precision fp16
Assume a standard Transformer uses 8 attention heads (so d_head = H / n_heads = 64) unless otherwise noted.
Tasks
-
Complexity and activation memory
-
Compare per-layer time and memory complexity of an LSTM layer vs a vanilla Transformer self-attention layer.
-
Estimate, to order-of-magnitude, the activation memory for a vanilla attention layer at B, H, T above (fp16). Show your arithmetic.
-
Architecture selection under a 24 GB GPU budget (no gradient checkpointing)
-
Choose one: LSTM, Transformer with sliding-window attention (window W = 512), or a Performer/FlashAttention-like variant.
-
Propose a concrete depth and justify the choice using asymptotic and constant-factor arguments, including a rough activation-memory budget per layer and total.
-
Positional encoding and extrapolation
-
Explain how you would encode position (absolute learned/sinusoidal, relative bias, ALiBi, rotary/RoPE).
-
Discuss how your choice affects extrapolation to T = 12,000 at inference.
-
Streaming inference plan
-
Describe a streaming inference strategy for your chosen model: KV-cache management and chunking.
-
Explain the latency implications (prefill vs per-token decode).