How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a easy difficulty ML System Design question, commonly asked during Onsite rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

How would you build an image classifier with dirty data?

Q: How would you build an image classifier with dirty data?

This question evaluates a candidate's ability to design end-to-end image classification systems and manage noisy image datasets, testing competencies in data validation, labeling strategy, pipeline reproducibility, and diagnostic analysis.

Scenario

You are asked to build an image classification model (single-label, multi-class) for a product team. The image dataset is known to be dirty (e.g., corrupted files, wrong labels, duplicates, irrelevant images, inconsistent formats). Compared with text classification, image inputs often require additional preprocessing and validation.

Tasks

Design the end-to-end approach to train and evaluate an image classifier.
Describe how you would measure the “dirty rate” of the image data (what counts as dirty, how to estimate it reliably).
Follow-up: After training a baseline, you discover performance is worse than expected. Explain how you would identify data problems (not just model problems) and propose concrete data and pipeline improvements .

Constraints / clarifications (you may state assumptions)

You may assume typical real-world constraints: limited labeling budget, heterogeneous image sources, and the need for reproducible training.
You should specify what metrics you would use (overall and per-class) and how you would validate improvements.

Scenario

Tasks

Design the end-to-end approach to train and evaluate an image classifier.

Describe how you would measure the “dirty rate” of the image data (what counts as dirty, how to estimate it reliably).

Follow-up: After training a baseline, you discover performance is worse than expected. Explain how you would identify data problems (not just model problems) and propose concrete data and pipeline improvements .

How would you build an image classifier with dirty data?

Quick Overview

How would you build an image classifier with dirty data?

Scenario

Tasks

Constraints / clarifications (you may state assumptions)

Submit Your Answer to Earn 20XP

How would you build an image classifier with dirty data?

Quick Overview

How would you build an image classifier with dirty data?

Scenario

Tasks

Constraints / clarifications (you may state assumptions)

Submit Your Answer to Earn 20XP