You’re discussing your experience with large-scale model training and inference on GPUs. The interviewer wants you to proactively cover optimization techniques, including low-level GPU/CUDA considerations.
Explain how you would approach end-to-end performance optimization for:
-
Training at scale
(multi-GPU / multi-node)
-
Online inference
(low latency) and
batch inference
(high throughput)
In your answer, cover:
-
Where time/memory goes
in typical deep learning workloads (compute vs memory vs communication).
-
Model-level optimizations
(architecture choices, activation checkpointing, etc.).
-
Numerical / precision optimizations
(FP16/BF16/FP8, loss scaling).
-
Parallelism strategies
(data/tensor/pipeline/expert parallel) and when to use each.
-
Communication optimization
(all-reduce overlap, gradient bucketing, NCCL tuning).
-
Kernel / CUDA-level ideas
(fusion, custom kernels, memory coalescing, avoiding syncs).
-
Inference-specific optimizations
(KV cache, batching, quantization, speculative decoding).
-
A practical plan: what you would measure first, and what changes you’d try next.