System Design: Distributed Web Crawler (1,000 Heterogeneous Workers)
Context
You are asked to design a production-grade web crawler that begins from a single seed URL and scales across 1,000 heterogeneous devices acting as distributed crawl workers. Devices vary in CPU, memory, network quality, and reliability. The system must be safe, polite, and resilient.
Assume: managed devices under your control; cross-Internet crawling; a long-running "campaign" that can be paused/resumed; and a need for near-real-time visibility into progress.
Requirements
Design the system covering the following:
-
URL frontier partitioning and deduplication
-
URL canonicalization; per-host/domain sharding; preventing duplicate enqueues; revisit policies.
-
Politeness policies
-
robots.txt parsing and caching; honoring crawl-delay and disallow rules; per-host/domain rate limiting and concurrency.
-
Prioritization
-
How URLs are ranked (depth, freshness, host/domain fairness, heuristics).
-
Retry strategy and content deduplication
-
Transient vs permanent errors; backoff; redirect handling; exact and near-duplicate content detection.
-
Failure handling and idempotency
-
Worker crashes; leases; replay; idempotent writes; dead-letter queues.
-
Coordination, work assignment, backpressure, and fetch semantics
-
Consistent hashing and/or queues; worker capability awareness; backpressure signaling; exactly-once vs at-least-once effects.
-
Storage schemas
-
Fetched pages and blobs; metadata and link graph; robots cache; frontier persistence.
-
Monitoring, alerting, and debugging
-
Metrics, logs, traces; per-host dashboards; sampling and replay.
-
Capacity planning
-
Estimating throughput, bandwidth, and storage; formulas and a concrete example.
-
Safety controls
-
Global and per-host limits; allow/block lists; legal/regulatory controls; kill-switches.
-
External APIs and data models
-
Enqueueing new URLs; querying status; fetching results.
Deliverables
-
High-level architecture with core components and data flows.
-
Algorithms/policies for the items above.
-
API shapes and data models (concise JSON examples are fine).
-
Back-of-the-envelope capacity estimates with assumptions.
-
Explicit assumptions and guardrails.