CIFAR-like Noisy Dataset: Baseline, Data Quality Plan, and First-Principles Backprop
Context: You have a CIFAR-like dataset of 32×32 RGB images, 10–20 classes. You suspect 8–15% label noise, some corrupted images, and class imbalance. You must deliver both a data-quality plan and a minimal from-scratch learning core.
Part 1 — Baseline classifier and data-quality improvement plan
Build a baseline classifier and a practical plan to improve data quality and robustness. Clearly describe:
-
How you will:
-
Detect and quantify label noise.
-
Identify and filter corrupted or low-quality samples.
-
Manage class imbalance.
-
Create robust train/validation/test splits and prevent leakage.
-
Compare mitigation strategies and when to use them:
-
Confidence-based pruning/reweighting.
-
Co-teaching.
-
Strong augmentations (e.g., MixUp, CutMix; optionally RandAugment).
-
Show how each step changes metrics:
-
Top-1 accuracy.
-
Calibration (e.g., Expected Calibration Error, ECE; Brier score).
-
Confusion matrix observations.
Assume you can train a small CNN/ResNet for the baseline. Keep a held-out test set untouched.
Part 2 — Core learning mechanics from first principles (NumPy-only)
Implement forward and backward passes for a two-layer neural network (Linear → ReLU → Linear → Softmax Cross-Entropy) using only NumPy-like linear algebra:
-
Write vectorized forward and backward computations, including analytical gradients for all parameters.
-
Validate gradients via numerical finite-difference checks.
-
Discuss and implement numerical stability (e.g., log-sum-exp for softmax), sensible initialization, and regularization.
-
Briefly describe how you would extend the architecture to CNNs suitable for this dataset.