PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/TikTok

Contrast LSTM and Transformer for long sequences

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of sequence-model architectures and system-level trade-offs for long-context autoregressive language models, covering computational and activation memory complexity, positional encoding and extrapolation behavior, and streaming inference and KV-cache management within GPU constraints.

  • hard
  • TikTok
  • Machine Learning
  • Data Scientist

Contrast LSTM and Transformer for long sequences

Company: TikTok

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

You must train an autoregressive language model on sequences of length T=8192 tokens, hidden size H=512, batch size B=8. (1) Compare the per-layer time and memory complexity for an LSTM vs a vanilla Transformer self-attention layer; estimate, to order-of-magnitude, the activation memory for attention (assume fp16) at this B,H,T. Show your arithmetic. (2) Propose an architecture that fits in 24 GB GPU memory without gradient checkpointing: pick between LSTM, Transformer with sliding-window attention (window 512), or a Performer/FlashAttention-like variant; justify with asymptotic and constant-factor considerations. (3) Explain how you would encode position (absolute, relative, ALiBi, rotary) and how that choice affects extrapolation to T=12,000 at inference. (4) Describe a streaming inference strategy (KV-cache management and chunking) for the chosen model and its latency implications.

Quick Answer: This question evaluates understanding of sequence-model architectures and system-level trade-offs for long-context autoregressive language models, covering computational and activation memory complexity, positional encoding and extrapolation behavior, and streaming inference and KV-cache management within GPU constraints.

Related Interview Questions

  • Design multimodal deployment under compute limits - TikTok (easy)
  • Explain overfitting, dropout, normalization, RL post-training - TikTok (medium)
  • Write self-attention and cross-entropy pseudocode - TikTok (medium)
  • Answer ML fundamentals and diagnostics questions - TikTok (hard)
  • Implement AUC-ROC, softmax, and logistic regression - TikTok (medium)
TikTok logo
TikTok
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Machine Learning
1
0
Loading...

Train a Long-Context Autoregressive LM (T = 8192, H = 512, B = 8)

You are training an autoregressive language model with:

  • Sequence length T = 8192 tokens
  • Hidden size H = 512
  • Batch size B = 8
  • Mixed precision fp16

Assume a standard Transformer uses 8 attention heads (so d_head = H / n_heads = 64) unless otherwise noted.

Tasks

  1. Complexity and activation memory
    • Compare per-layer time and memory complexity of an LSTM layer vs a vanilla Transformer self-attention layer.
    • Estimate, to order-of-magnitude, the activation memory for a vanilla attention layer at B, H, T above (fp16). Show your arithmetic.
  2. Architecture selection under a 24 GB GPU budget (no gradient checkpointing)
    • Choose one: LSTM, Transformer with sliding-window attention (window W = 512), or a Performer/FlashAttention-like variant.
    • Propose a concrete depth and justify the choice using asymptotic and constant-factor arguments, including a rough activation-memory budget per layer and total.
  3. Positional encoding and extrapolation
    • Explain how you would encode position (absolute learned/sinusoidal, relative bias, ALiBi, rotary/RoPE).
    • Discuss how your choice affects extrapolation to T = 12,000 at inference.
  4. Streaming inference plan
    • Describe a streaming inference strategy for your chosen model: KV-cache management and chunking.
    • Explain the latency implications (prefill vs per-token decode).

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More TikTok•More Data Scientist•TikTok Data Scientist•TikTok Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.