Debug and scale a PyTorch training loop
Company: Startups.Com
Role: Machine Learning Engineer
Category: Software Engineering Fundamentals
Difficulty: medium
Interview Round: Onsite
You are given a PyTorch training script for a CIFAR-10 image classifier that either:
- does not converge (accuracy stays near random), or
- becomes unstable (loss becomes NaN/Inf), or
- is much slower than expected.
1) Provide a systematic debugging checklist to find the root cause (data, model, loss, optimizer, device, precision, etc.). Include quick experiments you would run.
2) Then, describe how you would scale the same training to multi-GPU using **Fully Sharded Data Parallel (FSDP)**:
- How to initialize distributed training
- How to wrap the model
- How optimizer/gradients/state are handled
- How to handle gradient accumulation and checkpointing
3) Finally, discuss how you would implement or approximate a **sparse matrix / sparse gradient all-reduce**:
- What communication patterns you would use
- What tradeoffs exist vs dense all-reduce
- When it’s worth doing
4) Under what conditions would you propose writing a **custom CUDA/Triton kernel** for training performance, and how would you validate the speedup and correctness?
Quick Answer: This question evaluates debugging and performance engineering skills for deep learning training pipelines, including diagnosing convergence and instability issues in PyTorch, scaling training with Fully Sharded Data Parallel (FSDP), designing sparse gradient all-reduce communication patterns, and judging when to implement custom CUDA/Triton kernels. Commonly asked for Machine Learning Engineer roles within the Software Engineering Fundamentals domain, it assesses both conceptual understanding of numerical stability and distributed systems and practical application skills in configuring multi-GPU training, communication trade-offs, and low-level performance and correctness validation.