This question evaluates understanding of scalable web crawler architecture, distributed systems concepts, URL and content deduplication, scheduling and prioritization, storage and metadata design, and operational concerns such as politeness, rate limiting, DNS/connection management, and fault tolerance.

Design a production-ready web crawler that discovers and downloads publicly accessible web pages at internet scale. Your design should support continual discovery, politeness (respect for publishers), and high throughput while avoiding duplicates and crawler traps.
Assume we start with a list of seed URLs and aim to crawl and recrawl billions of pages over time. The crawler should be modular so it can run on a single machine for small jobs and scale out to a distributed cluster.
Deliverables: A clear architecture, key data structures and algorithms, scheduling logic, and operational considerations.
Login required