Design an image/video near-duplicate detection system
Company: OpenAI
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
## Question
Design a system to detect near-duplicate images/videos (e.g., reuploads, minor edits, different encodes) at large scale.
## Requirements
- Support both images and videos.
- Robust to resizing, cropping, re-encoding, watermarks, small edits.
- High throughput ingestion; low-latency query for takedown/merge/dedup.
- Handle billions of media items.
## Deliverables
- Fingerprinting approach (perceptual hashing vs embeddings).
- Indexing and retrieval architecture.
- Thresholding, evaluation, and operational concerns (false positives, adversarial behavior).
Quick Answer: This question evaluates competency in ML system design and large-scale multimedia retrieval, focusing on perceptual fingerprinting versus embedding strategies, scalable indexing and nearest-neighbor retrieval, and robustness to resizing, re-encoding, watermarks, minor edits, and adversarial manipulations.