Distributed Web Crawler: Design for 1,000 Devices
Context
Design a production-ready web crawler that starts from a single seed URL and scales crawling across 1,000 heterogeneous devices. The crawler should respect robots.txt and per-host politeness constraints, deduplicate URLs/content, and persist pages and metadata.
Requirements
-
Start from one seed link and discover new URLs recursively.
-
Distribute crawling across ~1,000 devices.
-
Address:
-
Coordination of work and state
-
Load balancing and throttling
-
Fault tolerance and recovery
-
Scalability and typical follow-ups
-
Assume an internet-scale target with diverse domains and varying latency.
Deliverables
-
High-level architecture and data flow
-
How URLs are assigned, deduplicated, and scheduled
-
Policies for robots.txt, per-host rate limits, retries
-
Storage approach for frontier state and fetched content
-
Specific mechanisms for coordination, load balancing, fault tolerance, and scaling