PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Applied Intuition

Debug a GPT training pipeline

Last updated: Mar 29, 2026

Quick Overview

This question evaluates debugging and implementation skills for transformer-based training pipelines, specifically attention masking, training-loop correctness, positional encoding integration, and the ability to design unit tests that catch such bugs.

  • medium
  • Applied Intuition
  • ML System Design
  • Machine Learning Engineer

Debug a GPT training pipeline

Company: Applied Intuition

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: medium

Interview Round: Technical Screen

Given a Colab notebook containing a minimal GPT-style language model with training and inference code, identify and fix three issues so that the training loss drops below a specified threshold on a small dataset: ( 1) incorrect attention masking (e.g., causal mask or padding mask mishandled); ( 2) a bug in the training loop (e.g., missing optimizer.zero_grad(), not calling model.train(), misaligned input/target token shift, wrong device placement, or incorrect loss reduction); and ( 3) missing positional encoding integration. Provide concrete code changes, unit tests that would have caught these bugs, and a brief rationale for each fix.

Quick Answer: This question evaluates debugging and implementation skills for transformer-based training pipelines, specifically attention masking, training-loop correctness, positional encoding integration, and the ability to design unit tests that catch such bugs.

Related Interview Questions

  • Implement KV cache for inference - Applied Intuition (hard)
Applied Intuition logo
Applied Intuition
Sep 6, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
ML System Design
12
0

Fix three bugs in a minimal GPT to meet a training-loss target

You are given a Colab notebook with a minimal GPT-style language model implemented in PyTorch (token embedding → transformer blocks → LM head), along with training and inference code. On a small toy dataset, training currently fails to reach the target loss.

Your task:

  • Identify and fix the following three issues so that training loss drops below a specified threshold on a small dataset:
    1. Incorrect attention masking (causal mask and/or padding mask mishandled).
    2. A bug in the training loop (e.g., missing optimizer.zero_grad(), not calling model.train(), misaligned input/target token shift, wrong device placement, or incorrect loss reduction).
    3. Missing positional encoding integration.
  • Provide concrete code changes (edits or snippets) that implement each fix.
  • Provide unit tests that would have caught each bug.
  • Include a brief rationale for each fix.

Assume:

  • PyTorch 2.x, CUDA if available.
  • A tiny character-level dataset (or synthetic tokens) with a small vocab and fixed max sequence length (e.g., 64), with optional padding. Use ignore_index for pad.
  • Success criterion: training loss on the toy training set drops below 1.0 within a few epochs on CPU, or within 0.5 on GPU.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Applied Intuition•More Machine Learning Engineer•Applied Intuition Machine Learning Engineer•Applied Intuition ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.