Select high-quality math documents from crawls

Q: Select high-quality math documents from crawls

This is a ML System Design interview question from OpenAI for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Loading...

Scenario

You have a web crawler that collects raw HTML/PDF documents. You want to build a pipeline that identifies high-quality math documents suitable for downstream use (e.g., search, dataset creation, or training).

Task

Design an end-to-end system to:

Extract math content from crawled pages.
Score and filter documents for quality.
Deduplicate and enforce licensing/safety constraints.

Requirements

Handle HTML, PDF, and scanned images.
Favor documents with substantial, correct mathematical content (not spam or low-effort copies).
Scale to tens/hundreds of millions of documents.
Provide measurable quality metrics and a human review loop.

Deliverables

Architecture, key features/signals, modeling approach, evaluation, and operations (monitoring, drift, reprocessing).

Select high-quality math documents from crawls

Scenario

Task

Requirements

Deliverables

Solution

Comments (0)