Design a concurrent web crawler

Q: Design a concurrent web crawler

This is a System Design interview question from Anthropic for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Web Crawler (Single-threaded and Concurrent)

Context

Design and implement a web crawler that starts from a set of seed URLs and explores the web while respecting operational constraints (robots.txt, scope, depth). Then extend the design to three concurrent implementations and compare them.

Assume you can use Python and reasonable open-source libraries for HTTP and HTML parsing. Focus on correctness, liveness, and performance trade-offs. Show code-level designs (pseudo-code or real code skeletons) and justify data structures and synchronization choices.

Requirements

Single-threaded crawler
- Inputs: seed URLs, max depth, optional allowed domain scope.
- Fetch pages, extract links, deduplicate visits, respect robots.txt, stop by depth and/or scope.
- Handle failures/retries with backoff.
Concurrent crawler: compare three approaches a) Manual multithreading using queues and locks b) Fixed-size thread pool c) Asyncio/event-loop model For each approach, explain and implement how you:
- Maintain a URL frontier
- Enforce per-host politeness and rate limits
- Avoid duplicate fetches
- Handle failures/retries and backoff
- Manage back-pressure
- Perform graceful shutdown
Discuss
- Data structures (visited sets, frontier queues, per-host buckets)
- Synchronization primitives and how you prevent deadlocks/starvation
- Time/space complexity
- Correctness, liveness, and performance trade-offs across the three models

Design a concurrent web crawler

System Design: Web Crawler (Single-threaded and Concurrent)

Context

Requirements

Solution (Locked)

Comments (0)