This ML System Design question evaluates the ability to design scalable, production-grade pipelines for extracting and quality-scoring mathematical content from heterogeneous formats (HTML, PDF, scanned images), exercising concepts such as information extraction and OCR, document scoring and deduplication, licensing and safety enforcement, and operational monitoring. It is commonly asked because real-world applications must handle noisy, web-crawled data at scale while delivering measurable quality metrics and human review workflows, making it a practical probe of high-level system architecture, modeling trade-offs, and production operations.
You have a web crawler that collects raw HTML/PDF documents. You want to build a pipeline that identifies high-quality math documents suitable for downstream use (e.g., search, dataset creation, or training).
Design an end-to-end system to:
Architecture, key features/signals, modeling approach, evaluation, and operations (monitoring, drift, reprocessing).