System Design: Concurrent Web Crawler (Threads)
Design and implement a basic web crawler that fetches pages concurrently using a thread executor (e.g. a ThreadPoolExecutor). The crawler should be production-conscious — correct, robust, and observable — while remaining reasonably simple. Where it helps, prefer standard URL parsing utilities (such as Python's urlparse/urljoin) for handling URLs.
You may present your answer as a brief design plus a code or pseudo-code sketch of the core algorithm and key components.
Input
-
Accept
one or more seed URLs
.
-
Support an optional flag to
restrict crawling to the same origin
as the seeds (matching on
scheme, host, and port
).
Crawling behavior
-
Fetch pages
concurrently
via a thread executor with a
configurable max worker count
.
-
Cap crawl depth
from each seed.
-
Extract links
from HTML pages and enqueue newly discovered URLs.
-
Normalize and resolve links robustly
— relative links, fragments, default ports, casing, and similar cases.
-
Avoid revisiting
the same normalized URL (dedup across both in-queue and already-visited URLs).
Compliance and politeness
-
Respect
robots.txt
: apply allow/disallow rules per user-agent, cache the rules per host, and honor
crawl-delay
if present.
-
Enforce
per-host politeness / rate limiting
— e.g. at most one request per host per X seconds (configurable) — and honor
Retry-After
on
429
/
503
responses.
Networking
-
Handle redirects
: update to the final URL and dedup on the normalized final URL.
-
Handle HTTP errors and timeouts gracefully
— do not crash, and back off when appropriate.
-
Filter by content type
(e.g. only
text/html
by default).
Data structures and strategy
-
Describe the data structures for the
frontier
and the
visited set
.
-
Describe the
duplicate-detection strategy
, including how you handle
enqueued vs. fetched URLs
and
redirects
.
Testing and monitoring
-
Explain
how you would test
the crawler — unit, integration, concurrency, and fault-injection tests.
-
Describe
what you would monitor/measure
in a real run — metrics, logs, and alerts.
Deliverables
-
A brief
architecture description and rationale
.
-
The
core algorithm and key components
(pseudo-code or a code sketch is fine).
-
A clear description of the
data structures and dedup logic
.
-
A
testing strategy
and a
monitoring plan
.