Concurrent Web Crawler — Bounded Thread Pool, Thread-Safe Frontier, Dedupe, Politeness, and Trade-offs
You are refactoring an existing single-threaded crawler to run concurrently. Design the system and explain key concurrency and reliability concerns.
Requirements
-
Concurrency & Backpressure
-
Run crawling with a bounded thread pool. Explain worker lifecycle, how work is acquired, and when workers/threads terminate.
-
Thread-Safe Data Structures
-
URL frontier (work queue): thread-safe, prevents head-of-line blocking, and supports backpressure.
-
Visited set: thread-safe deduplication with atomic test-and-set semantics.
-
Politeness & Reliability
-
Per-host rate limiting and max in-flight requests per host.
-
Timeouts and retries (transient failures) with backoff.
-
Control & Shutdown
-
Support cancellation and graceful shutdown.
-
Define termination/quiescence conditions when the crawl is "done."
-
Design Alternatives & Analysis
-
Compare coarse-grained locks, fine-grained locks, lock-free/concurrent data structures, and message-queue based designs.
-
Analyze trade-offs in contention, throughput, fairness, and memory usage.
Assume a single-process, multi-threaded crawler (language/runtime of your choice). Provide concise pseudo-code and justify design choices.