How would you optimize large-scale training/inference?

Q: How would you optimize large-scale training/inference?

This is a ML System Design interview question from NVIDIA for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Loading...

You’re discussing your experience with large-scale model training and inference on GPUs. The interviewer wants you to proactively cover optimization techniques, including low-level GPU/CUDA considerations.

Explain how you would approach end-to-end performance optimization for:

Training at scale (multi-GPU / multi-node)
Online inference (low latency) and batch inference (high throughput)

In your answer, cover:

Where time/memory goes in typical deep learning workloads (compute vs memory vs communication).
Model-level optimizations (architecture choices, activation checkpointing, etc.).
Numerical / precision optimizations (FP16/BF16/FP8, loss scaling).
Parallelism strategies (data/tensor/pipeline/expert parallel) and when to use each.
Communication optimization (all-reduce overlap, gradient bucketing, NCCL tuning).
Kernel / CUDA-level ideas (fusion, custom kernels, memory coalescing, avoiding syncs).
Inference-specific optimizations (KV cache, batching, quantization, speculative decoding).
A practical plan: what you would measure first, and what changes you’d try next.

How would you optimize large-scale training/inference?

Comments (0)