System Design: Web Crawler (Single-threaded and Concurrent)
Context
Design and implement a web crawler that starts from a set of seed URLs and explores the web while respecting operational constraints (robots.txt, scope, depth). Then extend the design to three concurrent implementations and compare them.
Assume you can use Python and reasonable open-source libraries for HTTP and HTML parsing. Focus on correctness, liveness, and performance trade-offs. Show code-level designs (pseudo-code or real code skeletons) and justify data structures and synchronization choices.
Requirements
-
Single-threaded crawler
-
Inputs: seed URLs, max depth, optional allowed domain scope.
-
Fetch pages, extract links, deduplicate visits, respect robots.txt, stop by depth and/or scope.
-
Handle failures/retries with backoff.
-
Concurrent crawler: compare three approaches
a) Manual multithreading using queues and locks
b) Fixed-size thread pool
c) Asyncio/event-loop model
For each approach, explain and implement how you:
-
Maintain a URL frontier
-
Enforce per-host politeness and rate limits
-
Avoid duplicate fetches
-
Handle failures/retries and backoff
-
Manage back-pressure
-
Perform graceful shutdown
-
Discuss
-
Data structures (visited sets, frontier queues, per-host buckets)
-
Synchronization primitives and how you prevent deadlocks/starvation
-
Time/space complexity
-
Correctness, liveness, and performance trade-offs across the three models