Optimize LLM Training and Serving
Company: Adobe
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
You are working on the training and deployment stack for a Transformer-based large language model. Explain how you would reason about performance bottlenecks and optimization opportunities across training, inference, and production serving. Address the following topics:
1. Why Transformer workloads are often memory-bound, and how memory bandwidth differs from compute throughput as a bottleneck.
2. The cost of materializing the attention matrix in high-bandwidth memory.
3. Hardware FLOPs utilization and model FLOPs utilization, including how to interpret them during training.
4. GPU profiling approaches, including identifying low utilization, memory stalls, communication bottlenecks, and kernel launch overhead.
5. Kernel fusion, fused attention kernels, and CUDA graph-style launch reduction.
6. How FlashAttention works internally, including tiling, SRAM usage, online softmax, and avoiding materialization of the full attention matrix in high-bandwidth memory.
7. Other attention and serving optimizations, including multi-query attention, grouped-query attention, sparse attention, linear attention, paged attention, and quantized KV caches.
8. Distributed training bottlenecks and throughput optimization techniques.
9. Inference optimization using compilers and runtimes such as TensorRT-style engines, graph optimization, operator fusion, and mixed precision inference.
10. Serving architecture considerations such as paged attention serving, batching, cache hit rates, end-to-end latency, offline feature generation, online feature store lookups, long-tail fallback systems, lightweight student models, and serving metrics.
Quick Answer: This question evaluates a candidate's competency in performance analysis and system-level optimization for Transformer-based large language models, covering memory vs compute bottlenecks, attention-kernel trade-offs (e.g., FlashAttention concepts), GPU profiling, distributed training throughput, and production serving optimizations.