Design web crawler for 1000 devices
Distributed Web Crawler: Design for 1,000 Devices
Context
Design a production-ready web crawler that starts from a single seed URL and scales crawling across 1,000 heterogeneous devices. The crawler should respect robots.txt and per-host politeness constraints, deduplicate URLs/content, and persist pages and metadata.
Requirements
-
Start from one seed link and discover new URLs recursively.
-
Distribute crawling across ~1,000 devices.
-
Address:
-
Coordination of work and state
-
Load balancing and throttling
-
Fault tolerance and recovery
-
Scalability and typical follow-ups
-
Assume an internet-scale target with diverse domains and varying latency.
Deliverables
-
High-level architecture and data flow
-
How URLs are assigned, deduplicated, and scheduled
-
Policies for robots.txt, per-host rate limits, retries
-
Storage approach for frontier state and fetched content
-
Specific mechanisms for coordination, load balancing, fault tolerance, and scaling
Constraints & Assumptions
-
Preserve the scope, facts, inputs, and requested outputs from the prompt above.
-
If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
-
Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.
Clarifying Questions to Ask
-
Clarify users, core use cases, read/write patterns, scale, latency, availability, and data retention.
-
State explicit assumptions before making sizing or architecture decisions.
-
Prioritize the functional path first, then address reliability, security, observability, and rollout.
What a Strong Answer Covers
-
A scoped requirements summary with concrete non-goals and success metrics.
-
API, data model, architecture, consistency, capacity, and operations.
-
Reasoned trade-offs among simple and scalable designs, including bottlenecks and failure modes.
-
A validation, monitoring, migration, and launch plan appropriate for the risk level.
Follow-up Questions
-
What breaks first at 10x traffic or data volume?
-
How would you degrade gracefully during dependency failures?
-
What metrics and alerts would prove the design is healthy after launch?