Build and troubleshoot image classification and backprop
Company: OpenAI
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You are given a CIFAR-like dataset of 32×32 color images across 10–20 classes with suspected data issues (label noise ~8–15%, corrupted images, and class imbalance).
1) Build a baseline classifier and a data-quality improvement plan: describe how you will detect and quantify label noise, identify and filter corrupted or low-quality samples, manage class imbalance, create robust train/validation/test splits, and prevent leakage. Compare mitigation strategies (e.g., confidence-based pruning, co-teaching, strong augmentations such as MixUp/CutMix), and show how each step changes metrics (top-1 accuracy, calibration, confusion matrix).
2) Implement core learning mechanics from first principles: using only NumPy-like linear algebra, write forward and backward passes for a two-layer network (linear → ReLU → linear → softmax cross-entropy), compute analytical gradients, and validate them with numerical gradient checks. Discuss numerical stability (log-sum-exp), initialization, regularization, and how you would extend to CNNs for this dataset.
Quick Answer: This question evaluates a candidate's skills in image classification, data-quality assessment (label-noise detection, corrupted-sample filtering, class imbalance handling), robustness strategy comparison, and low-level neural network mechanics including vectorized forward/backward computation and numerical stability for NumPy-only implementations.