This question evaluates debugging and performance engineering skills for deep learning training pipelines, including diagnosing convergence and instability issues in PyTorch, scaling training with Fully Sharded Data Parallel (FSDP), designing sparse gradient all-reduce communication patterns, and judging when to implement custom CUDA/Triton kernels. Commonly asked for Machine Learning Engineer roles within the Software Engineering Fundamentals domain, it assesses both conceptual understanding of numerical stability and distributed systems and practical application skills in configuring multi-GPU training, communication trade-offs, and low-level performance and correctness validation.
You are given a PyTorch training script for a CIFAR-10 image classifier that either: