System Design: Scalable Web Crawler
Context
Design a production-ready web crawler that discovers and downloads publicly accessible web pages at internet scale. Your design should support continual discovery, politeness (respect for publishers), and high throughput while avoiding duplicates and crawler traps.
Assume we start with a list of seed URLs and aim to crawl and recrawl billions of pages over time. The crawler should be modular so it can run on a single machine for small jobs and scale out to a distributed cluster.
Requirements
-
Architecture
-
Define core components: URL frontier, fetchers, parsers, storage, metadata/indexing, and coordination.
-
Include DNS resolution, connection management, and content-type handling.
-
Robots and Politeness
-
How to fetch and cache robots.txt; obey user-agent rules and crawl-delay directives.
-
Per-host/per-domain rate limiting and connection concurrency.
-
Deduplication
-
URL deduplication via canonicalization and a global "seen" structure.
-
Content deduplication (exact and near-duplicate pages).
-
Prioritization and Scheduling
-
How to prioritize which URLs to crawl next (e.g., depth, quality, freshness, domain budgets).
-
Recrawl scheduling for freshness.
-
Storage and Metadata
-
Where to store raw content (blobs) and structured metadata (fetch status, fingerprints, link graph, robots cache).
-
Scale and Throughput Targets
-
Make reasonable assumptions (e.g., initial 100M URLs, ~10k fetches/sec target) and reflect them in your choices.
-
Follow-up: Concurrency and Distribution
-
Extend to multithreaded and multi-machine operation.
-
Explain: concurrency controls, per-host rate limiting, back-pressure, fault tolerance, and processing semantics (exactly-once vs at-least-once).
Deliverables: A clear architecture, key data structures and algorithms, scheduling logic, and operational considerations.