You are training a Transformer-based text model in PyTorch for a sequence task (e.g., causal language modeling, sequence classification, or token classification). The model shows four symptoms:
Assume a standard training stack: PyTorch, Hugging Face–style tokenization, DataLoader(s), CrossEntropy loss variants (ignore_index, label smoothing, class weights), AdamW + scheduler, optional DDP, AMP, and gradient clipping.
Propose a systematic, end-to-end debugging plan to localize and resolve all four issues. For each area below, specify concrete checks/experiments, describe the failure signal(s), outline a minimal reproducible example or unit test, and state how you would implement and verify the fix:
Explain how each suspected bug would manifest, how you’d isolate it, and how to confirm the fix.
Login required