Web Crawlers, URL Normalization, And Politeness

What's being tested

These interviews probe whether you can design a concurrent, polite, fault-tolerant crawler rather than just “spawn threads and fetch URLs.” The interviewer is looking for practical distributed-systems judgment: URL frontier design, deduplication, synchronization, per-host rate limiting, failure recovery, and observability under network uncertainty. Anthropic cares about this because crawlers exercise the same engineering muscles as many production systems: high fan-out I/O, adversarial inputs, backpressure, fairness, idempotency, and careful resource usage. A strong answer makes tradeoffs explicit: single-machine versus distributed, exact versus approximate deduplication, throughput versus politeness, and freshness versus crawl coverage.

Core knowledge

URL frontier is the central scheduling abstraction: it stores discovered-but-unfetched URLs and decides what to fetch next. A robust design usually separates a global priority queue from per-host queues so you can enforce fairness and host-specific politeness without starving the crawl.
Per-host politeness means never hammering one domain even if many workers are idle. Track nextFetchTime[host]; a URL is eligible only when now >= nextFetchTime[host]. If crawl delay is $d$ seconds and last fetch was $t$ , schedule the next fetch at $t + d$ .
Concurrency model depends on scale. On one machine, a thread pool or async I/O loop with bounded queues is enough; network fetches are I/O-bound, so asyncio, epoll, or nonblocking sockets can outperform thousands of threads. Distributed crawlers need partitioning by host or URL hash.
Host partitioning is safer than pure URL hashing when enforcing politeness. If all URLs for example.com route to the same shard, rate limiting is local and simple. If URLs are randomly distributed, you need distributed locks or shared rate-limit state, which adds latency and failure modes.
URL normalization prevents duplicate work. Normalize scheme and host case, remove default ports, resolve . and .., strip fragments, sort query parameters when order is semantically irrelevant, percent-decode safe characters, and canonicalize trailing slashes carefully. Be conservative: over-normalization can merge distinct pages.
URL deduplication usually combines exact and approximate techniques. Exact sets using URL hashes work up to memory limits; a 64-bit hash for 1B URLs needs roughly 8 GB just for hashes before overhead. Bloom filters reduce memory with false positives: $p \approx \left(1 - e^{-kn/m}\right)^k$ where $m$ is bits, $n$ entries, and $k$ hash functions.
Content deduplication catches different URLs serving the same page. Use cryptographic hashes like SHA-256 for exact duplicates, and SimHash or MinHash for near-duplicates. URL dedup saves fetches; content dedup saves storage and downstream processing.
Robots and crawl directives are part of politeness. Fetch and cache robots.txt per host, honor Disallow, Allow, and Crawl-delay where applicable, and choose a clear user agent. Cache negative results too, but set TTLs because policies can change.
Backpressure is mandatory. Bound the frontier, fetch queue, parser queue, and storage writes; if storage slows down, workers should stop pulling more URLs rather than accumulating unbounded memory. Track queue depth, worker utilization, and fetch latency percentiles such as p95 and p99.
Fault tolerance requires idempotent state transitions. A URL can move through states like DISCOVERED, SCHEDULED, FETCHING, FETCHED, FAILED, and RETRYABLE. If a worker dies mid-fetch, a lease or visibility timeout should return the URL to the frontier without duplicating completed work.
Retry policy should distinguish transient and permanent failures. Retry 429, 500, 502, 503, and timeouts with exponential backoff and jitter; do not blindly retry 404 or 410. For 429, respect Retry-After and lower the host’s effective rate.
Observability should expose both crawler health and ethical behavior. Useful metrics include fetches/sec, bytes/sec, unique URLs discovered, duplicate rate, error rate by status code, per-host request rate, robots-denied count, frontier size, retry count, and storage write latency.

Worked example

For Design a concurrent web crawler, start by clarifying scope: “Is this single-machine or distributed? Should we crawl the whole web or a fixed set of domains? Do we need freshness recrawls, or just one-time discovery? What politeness constraints must we honor?” Then declare assumptions, for example: single region, distributed crawler, billions of URLs, HTML pages only, strict per-host rate limits, and persistent crawl state.

A strong answer can be organized into four pillars: frontier and scheduling, fetch/parse pipeline, deduplication and storage, and fault tolerance/observability. For frontier design, propose per-host queues plus a global min-heap ordered by each host’s next eligible fetch time. Workers pop an eligible host, fetch one URL, parse links, normalize and dedup them, enqueue new URLs, and update that host’s next fetch time.

For deduplication, use an exact URL-hash store if scale permits, or a Bloom filter in front of durable storage to reduce lookups. For persistence, store URL metadata, fetch status, content hash, timestamps, HTTP status, and retry count in a durable database or log-backed store. One explicit tradeoff to flag: partitioning by host simplifies politeness and reduces coordination, but may create hot shards for huge domains like wikipedia.org; you can mitigate with intra-host subqueues while preserving a single rate limiter.

Close by saying that, with more time, you would discuss recrawl scheduling, canonical URL signals, distributed rebalancing, and anti-abuse safeguards such as robots compliance and adaptive throttling after 429 responses.

A second angle

For Implement crawler, dedup, and persistent LRU, the same concepts become more implementation-focused and less about large-scale distributed architecture. The interviewer may expect you to write or sketch code for graph traversal, URL normalization, a visited set, and a durable cache with LRU eviction. Instead of discussing host-sharded frontier services, focus on clean data structures: a queue for BFS, a hash set for visited URLs, a doubly linked list plus hash map for O(1) LRU operations, and serialization for restart. The same tradeoff appears at smaller scale: exact deduplication is simple and correct, while persistent approximate deduplication saves memory but can skip valid URLs. This variant rewards code clarity, edge-case handling, and deterministic behavior under crashes.

Common pitfalls

Pitfall: Designing only a thread pool and a queue.

This is the most common incomplete answer: “Use 100 workers, each pops a URL and fetches it.” That ignores per-host politeness, dedup races, retries, and bounded memory. A better answer starts with the frontier as the control plane and treats workers as stateless executors.

Pitfall: Over-normalizing URLs.

It is tempting to sort all query parameters, drop all tracking-looking parameters, or force trailing-slash conventions globally. Those rules can merge distinct resources, especially for sites where query order or parameters are meaningful. Say you would apply conservative normalization first, then optionally learn site-specific canonicalization from redirects, <link rel="canonical">, and observed duplicates.

Pitfall: Claiming “exactly once” crawling.

In distributed networked systems, exactly-once fetch semantics are usually unrealistic and unnecessary. Aim for at-least-once execution with idempotent writes, deduplication, leases, and content hashing. Interviewers respond better to acknowledging duplicate fetches as possible and showing how you contain their cost.

Connections

Interviewers may pivot from this topic into distributed queues, rate limiting algorithms, consistent hashing, Bloom filters, cache eviction, or web security concerns such as SSRF prevention and malicious HTML parsing. They may also ask for a deeper dive on scheduler fairness, persistent state machines, or debugging a crawler whose throughput suddenly drops.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts