Design a concurrent web crawler

Q: Design a concurrent web crawler

This question evaluates understanding of concurrent system design, networking and parsing concerns, URL normalization and deduplication strategies, per-host politeness and rate limiting, error handling and observability for building robust web crawlers.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Onsite rounds at Anthropic.

Q: What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Question

System Design: Concurrent Web Crawler (Threads)

Design and implement a basic web crawler that fetches pages concurrently using a thread executor (e.g. a ThreadPoolExecutor). The crawler should be production-conscious — correct, robust, and observable — while remaining reasonably simple. Where it helps, prefer standard URL parsing utilities (such as Python's urlparse/urljoin) for handling URLs.

You may present your answer as a brief design plus a code or pseudo-code sketch of the core algorithm and key components.

Constraints & Assumptions

These anchor the discussion; treat the numbers as defaults you may revise after asking clarifying questions, not hard limits the solution must hit.

Workload shape: crawling is I/O-bound (network latency dominates over CPU), which motivates a thread-based executor rather than CPU-bound multiprocessing.
Scale: plan for up to roughly $10^6$ unique URLs in a single in-process run; beyond that, the dedup structures may need to change.
Concurrency: a configurable max worker count (e.g. on the order of tens of threads); the value is tunable, not fixed.
Politeness defaults: at most one request per host per configurable interval (e.g. ~1s) unless robots.txt specifies a different crawl-delay .
Content scope: by default only follow/parse text/html (and equivalent) responses; non-HTML responses are fetched-and-skipped, not parsed.
Single process: assume one machine / one process unless the candidate chooses to discuss distribution as an extension.

Clarifying Questions to Ask

A strong candidate scopes the problem before designing. Reasonable questions include:

Goal of the crawl — are we indexing a single site (same-origin) or crawling the open web from arbitrary seeds? Is breadth (coverage) or freshness the priority?
Scale & budget — roughly how many URLs/pages, and is there a wall-clock, page-count, or byte budget the crawl must respect?
Politeness requirements — must we strictly honor robots.txt and Crawl-delay , and what default per-host rate limit is acceptable?
Persistence — is this a one-shot in-memory run, or must the frontier/visited state survive a process restart?
Output — what do downstream consumers need (the raw HTML, extracted links, a normalized URL list, page metadata)?
Environment — single process/machine, or should the design extend to a distributed crawl? Are we free to use a library like requests , or is this a from-scratch exercise?

Requirements

Input

Accept one or more seed URLs .
Support an optional flag to restrict crawling to the same origin as the seeds (matching on scheme, host, and port ).

Crawling behavior

Fetch pages concurrently via a thread executor with a configurable max worker count .
Cap crawl depth from each seed.
Extract links from HTML pages and enqueue newly discovered URLs.
Normalize and resolve links robustly — relative links, fragments, default ports, casing, and similar cases.
Avoid revisiting the same normalized URL (dedup across both in-queue and already-visited URLs).

Compliance and politeness

Respect robots.txt : apply allow/disallow rules per user-agent, cache the rules per host, and honor crawl-delay if present.
Enforce per-host politeness / rate limiting — e.g. at most one request per host per X seconds (configurable) — and honor Retry-After on 429 / 503 responses.

Networking

Handle redirects : update to the final URL and dedup on the normalized final URL.
Handle HTTP errors and timeouts gracefully — do not crash, and back off when appropriate.
Filter by content type (e.g. only text/html by default).

Data structures and strategy

Describe the data structures for the frontier and the visited set .
Describe the duplicate-detection strategy , including how you handle enqueued vs. fetched URLs and redirects .

Termination

The crawl must stop cleanly when there is genuinely no more work — not the instant the queue momentarily drains while a slow fetch is still in flight.

Testing and monitoring

Explain how you would test the crawler — unit, integration, concurrency, and fault-injection tests.
Describe what you would monitor/measure in a real run — metrics, logs, and alerts.

Deliverables

A brief architecture description and rationale .
The core algorithm and key components (pseudo-code or a code sketch is fine).
A clear description of the data structures and dedup logic .
A testing strategy and a monitoring plan .

What a Strong Answer Covers

The interviewer is listening for these dimensions (not these answers — those are for you to produce):

Concurrency model justified — why a thread pool fits an I/O-bound workload, and how max_workers is bounded.
A clean component decomposition — frontier, fetcher, parser, URL normalizer, robots cache, per-host rate limiter, orchestrator, and how they interact.
A precise dedup story — what the canonical key is, where dedup happens (enqueue vs. fetch), and how redirects/canonical URLs fold in.
Correct termination — recognizing that an empty queue alone is not "done," and proposing a concrete quiescence-detection mechanism plus a backstop.
Politeness done right — robots semantics including failure modes, per-host limiting, and a sound Retry-After /backoff story that doesn't drop throttled work.
Robustness — graceful handling of timeouts, HTTP errors, malformed HTML, oversized bodies, and socket hygiene; a single bad page must never kill a worker.
Bounded resources — depth, page, byte, and time budgets; awareness of crawler traps / infinite URL spaces.
Testability & observability — a layered test plan (with injected time/fault injection) and concrete metrics, structured logs, and alert conditions.
Communication — stating assumptions, calling out the non-obvious risks (dedup, termination, leaks) explicitly.

Follow-up Questions

Be ready for the interviewer to push deeper:

What breaks first as you scale this to many millions of URLs on one box — the dedup sets, the frontier, the connection pool, or something else? What do you change?
How would you make this distributed so per-host politeness stays correct across multiple crawler nodes? (Hint to your own thinking: who "owns" a host?)
How would the design change if it had to survive a process restart mid-crawl without re-fetching everything or violating politeness?
Where would async I/O (asyncio/aiohttp) beat the thread pool , and what does it cost you in complexity — when would you not reach for it?

Design a concurrent web crawler

Quick Overview

Design a concurrent web crawler

System Design: Concurrent Web Crawler (Threads)

Constraints & Assumptions

Clarifying Questions to Ask

Requirements

Input

Crawling behavior

Compliance and politeness

Networking

Data structures and strategy

Termination

Testing and monitoring

Deliverables

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a concurrent web crawler

Quick Overview

Design a concurrent web crawler

System Design: Concurrent Web Crawler (Threads)

Constraints & Assumptions

Clarifying Questions to Ask

Requirements

Input

Crawling behavior

Compliance and politeness

Networking

Data structures and strategy

Termination

Testing and monitoring

Deliverables

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP