How would you optimize large-scale training/inference?
Company: NVIDIA
Role: Software Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Technical Screen
You’re discussing your experience with **large-scale model training and inference** on GPUs. The interviewer wants you to proactively cover optimization techniques, including low-level GPU/CUDA considerations.
Explain how you would approach **end-to-end performance optimization** for:
1. **Training at scale** (multi-GPU / multi-node)
2. **Online inference** (low latency) and **batch inference** (high throughput)
In your answer, cover:
- **Where time/memory goes** in typical deep learning workloads (compute vs memory vs communication).
- **Model-level optimizations** (architecture choices, activation checkpointing, etc.).
- **Numerical / precision optimizations** (FP16/BF16/FP8, loss scaling).
- **Parallelism strategies** (data/tensor/pipeline/expert parallel) and when to use each.
- **Communication optimization** (all-reduce overlap, gradient bucketing, NCCL tuning).
- **Kernel / CUDA-level ideas** (fusion, custom kernels, memory coalescing, avoiding syncs).
- **Inference-specific optimizations** (KV cache, batching, quantization, speculative decoding).
- A practical plan: what you would measure first, and what changes you’d try next.
Quick Answer: This question evaluates a candidate's skills in ML system design, GPU/CUDA performance engineering, and distributed training and inference optimization, focusing on identifying where time and memory are spent and the trade-offs across model-level, numerical, parallelism, communication, and kernel-level techniques.