This question evaluates understanding of concurrent system design, networking and parsing concerns, URL normalization and deduplication strategies, per-host politeness and rate limiting, error handling and observability for building robust web crawlers.
Design and implement a basic web crawler that fetches pages concurrently using a thread executor. Requirements: accept one or more seed URLs; use robust URL parsing to normalize and resolve links; avoid revisiting the same normalized URL; respect robots.txt and per-host politeness (rate limiting); cap concurrency and depth; optionally restrict to same-origin. Handle redirects, HTTP errors, timeouts, and content-type filtering. Describe data structures for the frontier and visited set, duplicate detection strategy, and how you would test and monitor the crawler.
Quick Answer: This question evaluates understanding of concurrent system design, networking and parsing concerns, URL normalization and deduplication strategies, per-host politeness and rate limiting, error handling and observability for building robust web crawlers.
Design and implement a basic web crawler that fetches pages concurrently using a thread executor (e.g. a ThreadPoolExecutor). The crawler should be production-conscious — correct, robust, and observable — while remaining reasonably simple. Where it helps, prefer standard URL parsing utilities (such as Python's urlparse/urljoin) for handling URLs.
You may present your answer as a brief design plus a code or pseudo-code sketch of the core algorithm and key components.
Constraints & Assumptions
These anchor the discussion; treat the numbers as defaults you may revise after asking clarifying questions, not hard limits the solution must hit.
Workload shape:
crawling is
I/O-bound
(network latency dominates over CPU), which motivates a thread-based executor rather than CPU-bound multiprocessing.
Scale:
plan for up to roughly
106
unique URLs in a single in-process run; beyond that, the dedup structures may need to change.
Concurrency:
a
configurable max worker count
(e.g. on the order of tens of threads); the value is tunable, not fixed.
Politeness defaults:
at most one request per host per configurable interval (e.g. ~1s) unless
robots.txt
specifies a different
crawl-delay
.
Content scope:
by default only follow/parse
text/html
(and equivalent) responses; non-HTML responses are fetched-and-skipped, not parsed.
Single process:
assume one machine / one process unless the candidate chooses to discuss distribution as an extension.
Clarifying Questions to Ask
A strong candidate scopes the problem before designing. Reasonable questions include:
Goal of the crawl
— are we indexing a single site (same-origin) or crawling the open web from arbitrary seeds? Is breadth (coverage) or freshness the priority?
Scale & budget
— roughly how many URLs/pages, and is there a wall-clock, page-count, or byte budget the crawl must respect?
Politeness requirements
— must we strictly honor
robots.txt
and
Crawl-delay
, and what default per-host rate limit is acceptable?
Persistence
— is this a one-shot in-memory run, or must the frontier/visited state survive a process restart?
Output
— what do downstream consumers need (the raw HTML, extracted links, a normalized URL list, page metadata)?
Environment
— single process/machine, or should the design extend to a distributed crawl? Are we free to use a library like
requests
, or is this a from-scratch exercise?
Requirements
Input
Accept
one or more seed URLs
.
Support an optional flag to
restrict crawling to the same origin
as the seeds (matching on
scheme, host, and port
).
Crawling behavior
Fetch pages
concurrently
via a thread executor with a
configurable max worker count
.
Cap crawl depth
from each seed.
Extract links
from HTML pages and enqueue newly discovered URLs.
Normalize and resolve links robustly
— relative links, fragments, default ports, casing, and similar cases.
Avoid revisiting
the same normalized URL (dedup across both in-queue and already-visited URLs).
Compliance and politeness
Respect robots.txt
: apply allow/disallow rules per user-agent, cache the rules per host, and honor
crawl-delay
if present.
Enforce
per-host politeness / rate limiting
— e.g. at most one request per host per X seconds (configurable) — and honor
Retry-After
on
429
/
503
responses.
Networking
Handle redirects
: update to the final URL and dedup on the normalized final URL.
Handle HTTP errors and timeouts gracefully
— do not crash, and back off when appropriate.
Filter by content type
(e.g. only
text/html
by default).
Data structures and strategy
Describe the data structures for the
frontier
and the
visited set
.
Describe the
duplicate-detection strategy
, including how you handle
enqueued vs. fetched URLs
and
redirects
.
Termination
The crawl must stop cleanly when there is genuinely no more work — not the instant the queue momentarily drains while a slow fetch is still in flight.
Testing and monitoring
Explain
how you would test
the crawler — unit, integration, concurrency, and fault-injection tests.
Describe
what you would monitor/measure
in a real run — metrics, logs, and alerts.
Deliverables
A brief
architecture description and rationale
.
The
core algorithm and key components
(pseudo-code or a code sketch is fine).
A clear description of the
data structures and dedup logic
.
A
testing strategy
and a
monitoring plan
.
What a Strong Answer Covers
The interviewer is listening for these dimensions (not these answers — those are for you to produce):
Concurrency model justified
— why a thread pool fits an I/O-bound workload, and how
max_workers
is bounded.
A clean component decomposition
— frontier, fetcher, parser, URL normalizer, robots cache, per-host rate limiter, orchestrator, and how they interact.
A precise dedup story
— what the canonical key is, where dedup happens (enqueue vs. fetch), and how redirects/canonical URLs fold in.
Correct termination
— recognizing that an empty queue alone is not "done," and proposing a concrete quiescence-detection mechanism plus a backstop.
Politeness done right
— robots semantics including failure modes, per-host limiting, and a sound
Retry-After
/backoff story that doesn't drop throttled work.
Robustness
— graceful handling of timeouts, HTTP errors, malformed HTML, oversized bodies, and socket hygiene; a single bad page must never kill a worker.
Bounded resources
— depth, page, byte, and time budgets; awareness of crawler traps / infinite URL spaces.
Testability & observability
— a layered test plan (with injected time/fault injection) and concrete metrics, structured logs, and alert conditions.
Communication
— stating assumptions, calling out the non-obvious risks (dedup, termination, leaks) explicitly.
Follow-up Questions
Be ready for the interviewer to push deeper:
What breaks first as you scale this to many millions of URLs on one box
— the dedup sets, the frontier, the connection pool, or something else? What do you change?
How would you make this distributed
so per-host politeness stays correct across multiple crawler nodes? (Hint to your own thinking: who "owns" a host?)
How would the design change if it had to survive a process restart
mid-crawl without re-fetching everything or violating politeness?
Where would async I/O (asyncio/aiohttp) beat the thread pool
, and what does it cost you in complexity — when would you
not
reach for it?