Design a distributed web crawler

Q: Design a distributed web crawler

This is a System Design interview question from Anthropic for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Loading...

Problem

Design a web crawler that starts from one or more seed URLs and continuously discovers and fetches pages.

Requirements

Inputs: One or more seed URLs.
Outputs: Fetched page contents and metadata stored for later indexing/analysis.
Core goals:
- Crawl at scale (large number of pages).
- Avoid crawling the same URL repeatedly (deduplication).
- Be polite: respect per-host rate limits and robots.txt .
- Be fault-tolerant (workers can crash; crawl should continue).

Follow-up

How would you redesign/optimize the crawler when you have multiple servers (many crawler workers)? You don’t need to implement code—describe the architecture and key data structures/services.

Design a distributed web crawler

Problem

Requirements

Follow-up

Comments (0)