Distributed Training and GPU Efficiency for Autonomy Models

What's being tested

Interviewers are probing practical mastery of scaling and optimizing model training across GPUs: you must show you can identify compute vs memory vs IO bottlenecks, pick the right parallelism strategy, and justify tradeoffs (throughput, cost, convergence). Tesla cares because autonomy models are large, multi-modal, and must be trained efficiently to iterate quickly while remaining reproducible and debuggable. Expect questions that probe both concrete knobs (batch size, AMP, AllReduce) and diagnosis workflows (profiling, metrics).

Core knowledge

Data-parallel training: replicate model across GPUs, each processes a shard of the batch; gradients synced with AllReduce each step. Effective batch size = per-GPU-batch * num_gpus * grad_accum_steps.
Model-parallelism families: tensor-parallelism shards single-layer tensors (Megatron style), pipeline-parallelism shards layers across ranks, and ZeRO / optimizer sharding moves optimizer/grad state off individual GPUs to reduce memory blowup.
Memory accounting: total GPU memory ≈ paramsbytes + activationsbytes + optimizer_states*bytes + workspace. Activations often dominate; use activation checkpointing to trade extra compute for memory.
Mixed precision: automatic mixed precision (AMP) uses FP16 for forward/backward and FP32 for master weights; requires loss scaling to avoid underflow. AMP typically halves activation memory and doubles throughput on Tensor Cores.
Gradient accumulation & large-batch LR scaling: scale LR roughly linearly with batch size (LR' = LR * k) and use warmup; monitor optimization stability — linear rule breaks at very large batches without adaptive optimizers or longer schedules.
Communication cost and AllReduce: NCCL ring AllReduce transfers ≈ 2*(p-1)/p * D bytes per rank for a D-byte tensor; small tensors suffer poor bandwidth/latency. Fuse small allreduces and overlap compute+comm.
I/O and preprocessing: CPU-side decoding/transforms, shuffling, and serialization (e.g., TFRecord/WebDataset) can throttle GPU utilization; use prefetch, parallel data loaders, and pinned memory to maintain >90% GPU occupancy.
Profiling & telemetry: use nvidia-smi, nvprof/Nsight, PyTorch profiler and in-pipeline time breakdowns (data-load, forward, backward, allreduce). Measure p99 host-to-device latency, GPU utilization, and flop efficiency.
Scalability limits: strong scaling (fixed total batch) saturates due to communication; weak scaling (fixed per-GPU batch) is more linear. For N beyond hundreds, communication topology and accelerator interconnect matter.
Determinism & reproducibility: set RNG seeds, control cudnn deterministic flags, and be aware AMP and asynchronous comms can introduce non-determinism; document when exact reproducibility is required.
Checkpointing & failure modes: frequent checkpoints increase wall-clock but reduce rework after failure; use sharded checkpoints (DeepSpeed/ZeRO) to avoid single-GPU OOM on load.
Cost metrics: prefer reporting cost/sample (GPU-hours per million samples), time-to-accuracy, and GPU utilization, not just TFLOPS.

Frame: ask clarifying Qs — target time-to-accuracy, per-sample memory footprint, input modalities and sequence length, GPU type and interconnect topology (NVLink vs Ethernet), and whether synchronous updates are required. Skeleton: (1) estimate memory per GPU (params, optimizer, activations) and compute effective batch to fill GPUs; (2) choose parallelism mix — start with data-parallel + ZeRO stage 2/3 to shard optimizer state and gradients; add tensor parallelism for very large parameter matrices if single-layer sizes exceed GPU memory; (3) implement AMP + activation checkpointing + gradient accumulation to reach desired effective batch; (4) optimize comms: fuse gradients, use NCCL and overlap backward compute with AllReduce. Tradeoff: ZeRO stage 3 minimizes memory but increases communication and checkpoint complexity; explain you'd prefer stage 2 initially for simpler debugging. Close: state monitoring plan (profiling runs to verify >80% utilization, train/val curve checks for convergence issues) and next steps if instability appears (reduce LR, increase warmup, or switch to hybrid parallelism).

A second angle — diagnosing low GPU utilization during training

Frame quick diagnostics: measure GPU utilization, memory occupancy, and per-step time breakdown (data-load, forward, backward, allreduce). If data-load dominates, increase num_workers, use pinned memory, or move preprocessing into faster format (sharded WebDataset). If small-allreduce latency dominates (many small gradients), enable gradient fusion or layer-wise reduce scheduling. If kernel occupancy is low, switch to larger per-GPU batch size or enable AMP to leverage Tensor Cores. Emphasize verifying with profiler traces before making changes and noting convergence impacts (e.g., larger batch affects LR schedule and generalization).

Common pitfalls

Pitfall: Treating GPU utilization as the only metric. High nvidia-smi utilization can hide poor FLOP efficiency or wasted memory stalls; always pair utilization with profiler-derived kernel timelines and device-side memory metrics.

Pitfall: Blindly increasing batch size and linearly scaling learning rate. This often destabilizes optimization for complex autonomy losses; always run short convergence checks and consider adaptive optimizers or longer warmup when scaling.

Pitfall: Overusing complex model-parallel techniques early. Jumping to pipeline or tensor parallelism without exhausting ZeRO/AMP and data-parallel tuning increases engineering overhead and debugging difficulty; prefer simpler solutions that meet requirements first.

Connections

Interviewers may pivot to model serving and inference optimizations (quantization, batching for real-time constraints) or to data-pipeline engineering (sharding training data, reproducible sampling). They may also ask about hyperparameter search at scale (efficient search strategies and resource-aware tuning).

What's being tested

Core knowledge

A second angle — diagnosing low GPU utilization during training

Common pitfalls

Connections

Further reading

Related concepts

What's being tested

Core knowledge

Worked example — scaling a 2B-parameter multi-modal perception model across 64 GPUs

A second angle — diagnosing low GPU utilization during training

Common pitfalls

Connections

Further reading

Related concepts