Scenario
You are asked to build an image classification model (single-label, multi-class) for a product team. The image dataset is known to be dirty (e.g., corrupted files, wrong labels, duplicates, irrelevant images, inconsistent formats). Compared with text classification, image inputs often require additional preprocessing and validation.
Tasks
-
Design the end-to-end approach
to train and evaluate an image classifier.
-
Describe how you would
measure the “dirty rate”
of the image data (what counts as dirty, how to estimate it reliably).
-
Follow-up:
After training a baseline, you discover performance is worse than expected. Explain how you would
identify data problems
(not just model problems) and propose concrete
data and pipeline improvements
.
Constraints / clarifications (you may state assumptions)
-
You may assume typical real-world constraints: limited labeling budget, heterogeneous image sources, and the need for reproducible training.
-
You should specify what metrics you would use (overall and per-class) and how you would validate improvements.