The question evaluates a candidate's ability to design large-scale distributed systems and web-crawling infrastructure, testing competencies such as URL frontier partitioning and deduplication, politeness and rate-limiting, prioritization, retry and idempotency strategies, coordination and backpressure, storage schemas, monitoring, capacity planning, safety controls, and API/data-model design. Commonly asked in System Design interviews to probe architectural thinking and trade-offs around scalability, heterogeneity, reliability, and operational controls, it primarily tests practical application and system-architecture skills while requiring conceptual understanding of distributed-systems principles.
You are asked to design a production-grade web crawler that begins from a single seed URL and scales across 1,000 heterogeneous devices acting as distributed crawl workers. Devices vary in CPU, memory, network quality, and reliability. The system must be safe, polite, and resilient.
Assume: managed devices under your control; cross-Internet crawling; a long-running "campaign" that can be paused/resumed; and a need for near-real-time visibility into progress.
Design the system covering the following:
Login required