Scenario
You have a web crawler that collects raw HTML/PDF documents. You want to build a pipeline that identifies high-quality math documents suitable for downstream use (e.g., search, dataset creation, or training).
Task
Design an end-to-end system to:
-
Extract math content from crawled pages.
-
Score and filter documents for quality.
-
Deduplicate and enforce licensing/safety constraints.
Requirements
-
Handle HTML, PDF, and scanned images.
-
Favor documents with substantial, correct mathematical content (not spam or low-effort copies).
-
Scale to tens/hundreds of millions of documents.
-
Provide measurable quality metrics and a human review loop.
Deliverables
Architecture, key features/signals, modeling approach, evaluation, and operations (monitoring, drift, reprocessing).