Problem
Design a web crawler that starts from one or more seed URLs and continuously discovers and fetches pages.
Requirements
-
Inputs:
One or more seed URLs.
-
Outputs:
Fetched page contents and metadata stored for later indexing/analysis.
-
Core goals:
-
Crawl at scale (large number of pages).
-
Avoid crawling the same URL repeatedly (deduplication).
-
Be polite: respect per-host rate limits and
robots.txt
.
-
Be fault-tolerant (workers can crash; crawl should continue).
Follow-up
How would you redesign/optimize the crawler when you have multiple servers (many crawler workers)? You don’t need to implement code—describe the architecture and key data structures/services.