PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/ML System Design/Adobe

Optimize LLM Training and Serving

Last updated: May 30, 2026

Quick Overview

This question evaluates a candidate's competency in performance analysis and system-level optimization for Transformer-based large language models, covering memory vs compute bottlenecks, attention-kernel trade-offs (e.g., FlashAttention concepts), GPU profiling, distributed training throughput, and production serving optimizations.

  • hard
  • Adobe
  • ML System Design
  • Machine Learning Engineer

Optimize LLM Training and Serving

Company: Adobe

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

You are working on the training and deployment stack for a Transformer-based large language model. Explain how you would reason about performance bottlenecks and optimization opportunities across training, inference, and production serving. Address the following topics: 1. Why Transformer workloads are often memory-bound, and how memory bandwidth differs from compute throughput as a bottleneck. 2. The cost of materializing the attention matrix in high-bandwidth memory. 3. Hardware FLOPs utilization and model FLOPs utilization, including how to interpret them during training. 4. GPU profiling approaches, including identifying low utilization, memory stalls, communication bottlenecks, and kernel launch overhead. 5. Kernel fusion, fused attention kernels, and CUDA graph-style launch reduction. 6. How FlashAttention works internally, including tiling, SRAM usage, online softmax, and avoiding materialization of the full attention matrix in high-bandwidth memory. 7. Other attention and serving optimizations, including multi-query attention, grouped-query attention, sparse attention, linear attention, paged attention, and quantized KV caches. 8. Distributed training bottlenecks and throughput optimization techniques. 9. Inference optimization using compilers and runtimes such as TensorRT-style engines, graph optimization, operator fusion, and mixed precision inference. 10. Serving architecture considerations such as paged attention serving, batching, cache hit rates, end-to-end latency, offline feature generation, online feature store lookups, long-tail fallback systems, lightweight student models, and serving metrics.

Quick Answer: This question evaluates a candidate's competency in performance analysis and system-level optimization for Transformer-based large language models, covering memory vs compute bottlenecks, attention-kernel trade-offs (e.g., FlashAttention concepts), GPU profiling, distributed training throughput, and production serving optimizations.

Related Interview Questions

  • Design a multimodal embedding service - Adobe (hard)
  • Design a natural-language AEP Q&A assistant - Adobe (hard)
  • Design file-embedding storage system - Adobe (hard)
Adobe logo
Adobe
May 25, 2026, 12:00 AM
Machine Learning Engineer
Onsite
ML System Design
5
0

You are working on the training and deployment stack for a Transformer-based large language model. Explain how you would reason about performance bottlenecks and optimization opportunities across training, inference, and production serving. Address the following topics:

  1. Why Transformer workloads are often memory-bound, and how memory bandwidth differs from compute throughput as a bottleneck.
  2. The cost of materializing the attention matrix in high-bandwidth memory.
  3. Hardware FLOPs utilization and model FLOPs utilization, including how to interpret them during training.
  4. GPU profiling approaches, including identifying low utilization, memory stalls, communication bottlenecks, and kernel launch overhead.
  5. Kernel fusion, fused attention kernels, and CUDA graph-style launch reduction.
  6. How FlashAttention works internally, including tiling, SRAM usage, online softmax, and avoiding materialization of the full attention matrix in high-bandwidth memory.
  7. Other attention and serving optimizations, including multi-query attention, grouped-query attention, sparse attention, linear attention, paged attention, and quantized KV caches.
  8. Distributed training bottlenecks and throughput optimization techniques.
  9. Inference optimization using compilers and runtimes such as TensorRT-style engines, graph optimization, operator fusion, and mixed precision inference.
  10. Serving architecture considerations such as paged attention serving, batching, cache hit rates, end-to-end latency, offline feature generation, online feature store lookups, long-tail fallback systems, lightweight student models, and serving metrics.

Solution

Show

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Adobe•More Machine Learning Engineer•Adobe Machine Learning Engineer•Adobe ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.