Build and troubleshoot image classification and backprop

Q: Build and troubleshoot image classification and backprop

This is a Machine Learning interview question from OpenAI for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

CIFAR-like Noisy Dataset: Baseline, Data Quality Plan, and First-Principles Backprop

Context: You have a CIFAR-like dataset of 32×32 RGB images, 10–20 classes. You suspect 8–15% label noise, some corrupted images, and class imbalance. You must deliver both a data-quality plan and a minimal from-scratch learning core.

Part 1 — Baseline classifier and data-quality improvement plan

Build a baseline classifier and a practical plan to improve data quality and robustness. Clearly describe:

How you will:
- Detect and quantify label noise.
- Identify and filter corrupted or low-quality samples.
- Manage class imbalance.
- Create robust train/validation/test splits and prevent leakage.
Compare mitigation strategies and when to use them:
- Confidence-based pruning/reweighting.
- Co-teaching.
- Strong augmentations (e.g., MixUp, CutMix; optionally RandAugment).
Show how each step changes metrics:
- Top-1 accuracy.
- Calibration (e.g., Expected Calibration Error, ECE; Brier score).
- Confusion matrix observations.

Assume you can train a small CNN/ResNet for the baseline. Keep a held-out test set untouched.

Part 2 — Core learning mechanics from first principles (NumPy-only)

Implement forward and backward passes for a two-layer neural network (Linear → ReLU → Linear → Softmax Cross-Entropy) using only NumPy-like linear algebra:

Write vectorized forward and backward computations, including analytical gradients for all parameters.
Validate gradients via numerical finite-difference checks.
Discuss and implement numerical stability (e.g., log-sum-exp for softmax), sensible initialization, and regularization.
Briefly describe how you would extend the architecture to CNNs suitable for this dataset.

Build and troubleshoot image classification and backprop

CIFAR-like Noisy Dataset: Baseline, Data Quality Plan, and First-Principles Backprop

Part 1 — Baseline classifier and data-quality improvement plan

Part 2 — Core learning mechanics from first principles (NumPy-only)

Solution (Locked)

Comments (0)