Scale crawler with thread pool

Q: Scale crawler with thread pool

This is a System Design interview question from Anthropic for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Concurrent Web Crawler — Bounded Thread Pool, Thread-Safe Frontier, Dedupe, Politeness, and Trade-offs

You are refactoring an existing single-threaded crawler to run concurrently. Design the system and explain key concurrency and reliability concerns.

Requirements

Concurrency & Backpressure
- Run crawling with a bounded thread pool. Explain worker lifecycle, how work is acquired, and when workers/threads terminate.
Thread-Safe Data Structures
- URL frontier (work queue): thread-safe, prevents head-of-line blocking, and supports backpressure.
- Visited set: thread-safe deduplication with atomic test-and-set semantics.
Politeness & Reliability
- Per-host rate limiting and max in-flight requests per host.
- Timeouts and retries (transient failures) with backoff.
Control & Shutdown
- Support cancellation and graceful shutdown.
- Define termination/quiescence conditions when the crawl is "done."
Design Alternatives & Analysis
- Compare coarse-grained locks, fine-grained locks, lock-free/concurrent data structures, and message-queue based designs.
- Analyze trade-offs in contention, throughput, fairness, and memory usage.

Assume a single-process, multi-threaded crawler (language/runtime of your choice). Provide concise pseudo-code and justify design choices.

Scale crawler with thread pool

Concurrent Web Crawler — Bounded Thread Pool, Thread-Safe Frontier, Dedupe, Politeness, and Trade-offs

Requirements

Solution (Locked)

Comments (0)