Design a concurrent web crawler

Q: Design a concurrent web crawler

This is a System Design interview question from Anthropic for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Concurrent Web Crawler (Threads)

You are asked to design and implement a basic web crawler that fetches pages concurrently using a thread executor. The crawler should be production-conscious (correctness, robustness, and observability) while remaining reasonably simple.

Requirements

Input
1. Accept one or more seed URLs.
2. Optional flag to restrict crawling to the same origin as the seeds (scheme, host, port).
Crawling behavior
1. Fetch pages concurrently using a thread executor with a configurable max worker count.
2. Cap crawl depth from each seed.
3. Extract links from HTML pages and enqueue newly discovered URLs.
4. Normalize and resolve links robustly (relative links, fragments, default ports, casing, etc.).
5. Avoid revisiting the same normalized URL (dedup across in-queue and visited).
Compliance and politeness
1. Respect robots.txt (allow/disallow rules per user-agent; cache per host; honor crawl-delay if present).
2. Per-host politeness/rate limiting (e.g., at most 1 request per host per X seconds, configurable; honor Retry-After on 429/503).
Networking
1. Handle redirects (update to final URL; dedup on the normalized final URL).
2. Handle HTTP errors and timeouts gracefully (do not crash; backoff when appropriate).
3. Filter by content type (e.g., only text/html by default).
Data structures and strategy
1. Describe the frontier and visited set data structures.
2. Describe the duplicate detection strategy (including enqueued vs. fetched URLs and redirects).
Testing and monitoring
1. Explain how you would test the crawler (unit, integration, concurrency, and fault-injection tests).
2. Describe what you would monitor/measure in a real run (metrics, logs, alerts).

Deliverables

A brief architecture description and rationale.
Core algorithm and key components (pseudo-code or code sketch is fine).
Clear description of data structures and dedup logic.
Testing strategy and monitoring plan.

Design a concurrent web crawler

System Design: Concurrent Web Crawler (Threads)

Requirements

Deliverables

Solution (Locked)

Comments (0)