How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Onsite rounds at Adobe.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Adobe during technical interviews.

Optimize LLM Training and Serving | Adobe Interview Question

Quick Overview

This question evaluates a candidate's competency in performance analysis and system-level optimization for Transformer-based large language models, covering memory vs compute bottlenecks, attention-kernel trade-offs (e.g., FlashAttention concepts), GPU profiling, distributed training throughput, and production serving optimizations.

You are working on the training and deployment stack for a Transformer-based large language model. Explain how you would reason about performance bottlenecks and optimization opportunities across training, inference, and production serving. Address the following topics:

Why Transformer workloads are often memory-bound, and how memory bandwidth differs from compute throughput as a bottleneck.
The cost of materializing the attention matrix in high-bandwidth memory.
Hardware FLOPs utilization and model FLOPs utilization, including how to interpret them during training.
GPU profiling approaches, including identifying low utilization, memory stalls, communication bottlenecks, and kernel launch overhead.
Kernel fusion, fused attention kernels, and CUDA graph-style launch reduction.
How FlashAttention works internally, including tiling, SRAM usage, online softmax, and avoiding materialization of the full attention matrix in high-bandwidth memory.
Other attention and serving optimizations, including multi-query attention, grouped-query attention, sparse attention, linear attention, paged attention, and quantized KV caches.
Distributed training bottlenecks and throughput optimization techniques.
Inference optimization using compilers and runtimes such as TensorRT-style engines, graph optimization, operator fusion, and mixed precision inference.
Serving architecture considerations such as paged attention serving, batching, cache hit rates, end-to-end latency, offline feature generation, online feature store lookups, long-tail fallback systems, lightweight student models, and serving metrics.

Quick Overview

Why Transformer workloads are often memory-bound, and how memory bandwidth differs from compute throughput as a bottleneck.
The cost of materializing the attention matrix in high-bandwidth memory.
Hardware FLOPs utilization and model FLOPs utilization, including how to interpret them during training.
GPU profiling approaches, including identifying low utilization, memory stalls, communication bottlenecks, and kernel launch overhead.
Kernel fusion, fused attention kernels, and CUDA graph-style launch reduction.
How FlashAttention works internally, including tiling, SRAM usage, online softmax, and avoiding materialization of the full attention matrix in high-bandwidth memory.
Other attention and serving optimizations, including multi-query attention, grouped-query attention, sparse attention, linear attention, paged attention, and quantized KV caches.
Distributed training bottlenecks and throughput optimization techniques.
Inference optimization using compilers and runtimes such as TensorRT-style engines, graph optimization, operator fusion, and mixed precision inference.
Serving architecture considerations such as paged attention serving, batching, cache hit rates, end-to-end latency, offline feature generation, online feature store lookups, long-tail fallback systems, lightweight student models, and serving metrics.

Optimize LLM Training and Serving

Quick Overview

Solution

Submit Your Answer

Optimize LLM Training and Serving

Quick Overview

Solution

Submit Your Answer