Design web crawler for 1000 devices

Q: Design web crawler for 1000 devices

This question evaluates understanding of distributed systems and large-scale web crawling, including coordination and state management, URL deduplication and scheduling, politeness and per-host rate limiting, fault tolerance, load balancing, and storage for fetched content.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Distributed Web Crawler: Design for 1,000 Devices

Context

Design a production-ready web crawler that starts from a single seed URL and scales crawling across 1,000 heterogeneous devices. The crawler should respect robots.txt and per-host politeness constraints, deduplicate URLs/content, and persist pages and metadata.

Requirements

Start from one seed link and discover new URLs recursively.
Distribute crawling across ~1,000 devices.
Address:
1. Coordination of work and state
2. Load balancing and throttling
3. Fault tolerance and recovery
4. Scalability and typical follow-ups
Assume an internet-scale target with diverse domains and varying latency.

Deliverables

High-level architecture and data flow
How URLs are assigned, deduplicated, and scheduled
Policies for robots.txt, per-host rate limits, retries
Storage approach for frontier state and fetched content
Specific mechanisms for coordination, load balancing, fault tolerance, and scaling

Design web crawler for 1000 devices

Distributed Web Crawler: Design for 1,000 Devices

Context

Requirements

Deliverables

Solution

Comments (0)