PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/OpenAI

How would you build an image classifier with dirty data?

Last updated: Apr 13, 2026

Quick Overview

This question evaluates a candidate's ability to design end-to-end image classification systems and manage noisy image datasets, testing competencies in data validation, labeling strategy, pipeline reproducibility, and diagnostic analysis.

  • easy
  • OpenAI
  • ML System Design
  • Machine Learning Engineer

How would you build an image classifier with dirty data?

Company: OpenAI

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: easy

Interview Round: Onsite

## Scenario You are asked to build an **image classification** model (single-label, multi-class) for a product team. The image dataset is known to be **dirty** (e.g., corrupted files, wrong labels, duplicates, irrelevant images, inconsistent formats). Compared with text classification, image inputs often require additional preprocessing and validation. ## Tasks 1. **Design the end-to-end approach** to train and evaluate an image classifier. 2. Describe how you would **measure the “dirty rate”** of the image data (what counts as dirty, how to estimate it reliably). 3. **Follow-up:** After training a baseline, you discover performance is worse than expected. Explain how you would **identify data problems** (not just model problems) and propose concrete **data and pipeline improvements**. ## Constraints / clarifications (you may state assumptions) - You may assume typical real-world constraints: limited labeling budget, heterogeneous image sources, and the need for reproducible training. - You should specify what metrics you would use (overall and per-class) and how you would validate improvements.

Quick Answer: This question evaluates a candidate's ability to design end-to-end image classification systems and manage noisy image datasets, testing competencies in data validation, labeling strategy, pipeline reproducibility, and diagnostic analysis.

Related Interview Questions

  • Design a GPU-Efficient Video Service - OpenAI (medium)
  • Design a RAG system with evaluation - OpenAI (medium)
  • Design an AWS fine-tuning platform for LLMs - OpenAI (hard)
  • Design a Retrieval-Augmented Generation (RAG) system - OpenAI (hard)
  • Design a chatbot fallback for unknown questions - OpenAI (hard)
OpenAI logo
OpenAI
Jan 6, 2026, 12:00 AM
Machine Learning Engineer
Onsite
ML System Design
37
0
Loading...

Scenario

You are asked to build an image classification model (single-label, multi-class) for a product team. The image dataset is known to be dirty (e.g., corrupted files, wrong labels, duplicates, irrelevant images, inconsistent formats). Compared with text classification, image inputs often require additional preprocessing and validation.

Tasks

  1. Design the end-to-end approach to train and evaluate an image classifier.
  2. Describe how you would measure the “dirty rate” of the image data (what counts as dirty, how to estimate it reliably).
  3. Follow-up: After training a baseline, you discover performance is worse than expected. Explain how you would identify data problems (not just model problems) and propose concrete data and pipeline improvements .

Constraints / clarifications (you may state assumptions)

  • You may assume typical real-world constraints: limited labeling budget, heterogeneous image sources, and the need for reproducible training.
  • You should specify what metrics you would use (overall and per-class) and how you would validate improvements.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.