PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Anthropic

Design a concurrent web crawler

Last updated: May 8, 2026

Quick Overview

This question evaluates understanding of concurrent system design, networking and parsing concerns, URL normalization and deduplication strategies, per-host politeness and rate limiting, error handling and observability for building robust web crawlers.

  • hard
  • Anthropic
  • System Design
  • Software Engineer

Design a concurrent web crawler

Company: Anthropic

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

Design and implement a basic web crawler that fetches pages concurrently using a thread executor. Requirements: accept one or more seed URLs; use robust URL parsing to normalize and resolve links; avoid revisiting the same normalized URL; respect robots.txt and per-host politeness (rate limiting); cap concurrency and depth; optionally restrict to same-origin. Handle redirects, HTTP errors, timeouts, and content-type filtering. Describe data structures for the frontier and visited set, duplicate detection strategy, and how you would test and monitor the crawler.

Quick Answer: This question evaluates understanding of concurrent system design, networking and parsing concerns, URL normalization and deduplication strategies, per-host politeness and rate limiting, error handling and observability for building robust web crawlers.

Related Interview Questions

  • Design a One-on-One Chat Service - Anthropic (medium)
  • Design a prompt playground - Anthropic (hard)
  • Scale Duplicate File Detection - Anthropic (medium)
  • Design a one-to-one chat system - Anthropic (medium)
  • Design One-to-One Chat - Anthropic (medium)
|Home/System Design/Anthropic

Design a concurrent web crawler

Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
hardSoftware EngineerOnsiteSystem Design
86
0

System Design: Concurrent Web Crawler (Threads)

Design and implement a basic web crawler that fetches pages concurrently using a thread executor (e.g. a ThreadPoolExecutor). The crawler should be production-conscious — correct, robust, and observable — while remaining reasonably simple. Where it helps, prefer standard URL parsing utilities (such as Python's urlparse/urljoin) for handling URLs.

You may present your answer as a brief design plus a code or pseudo-code sketch of the core algorithm and key components.

Constraints & Assumptions

These anchor the discussion; treat the numbers as defaults you may revise after asking clarifying questions, not hard limits the solution must hit.

  • Workload shape: crawling is I/O-bound (network latency dominates over CPU), which motivates a thread-based executor rather than CPU-bound multiprocessing.
  • Scale: plan for up to roughly 10610^6106 unique URLs in a single in-process run; beyond that, the dedup structures may need to change.
  • Concurrency: a configurable max worker count (e.g. on the order of tens of threads); the value is tunable, not fixed.
  • Politeness defaults: at most one request per host per configurable interval (e.g. ~1s) unless robots.txt specifies a different crawl-delay .
  • Content scope: by default only follow/parse text/html (and equivalent) responses; non-HTML responses are fetched-and-skipped, not parsed.
  • Single process: assume one machine / one process unless the candidate chooses to discuss distribution as an extension.

Clarifying Questions to Ask

A strong candidate scopes the problem before designing. Reasonable questions include:

  1. Goal of the crawl — are we indexing a single site (same-origin) or crawling the open web from arbitrary seeds? Is breadth (coverage) or freshness the priority?
  2. Scale & budget — roughly how many URLs/pages, and is there a wall-clock, page-count, or byte budget the crawl must respect?
  3. Politeness requirements — must we strictly honor robots.txt and Crawl-delay , and what default per-host rate limit is acceptable?
  4. Persistence — is this a one-shot in-memory run, or must the frontier/visited state survive a process restart?
  5. Output — what do downstream consumers need (the raw HTML, extracted links, a normalized URL list, page metadata)?
  6. Environment — single process/machine, or should the design extend to a distributed crawl? Are we free to use a library like requests , or is this a from-scratch exercise?

Requirements

Input

  • Accept one or more seed URLs .
  • Support an optional flag to restrict crawling to the same origin as the seeds (matching on scheme, host, and port ).

Crawling behavior

  • Fetch pages concurrently via a thread executor with a configurable max worker count .
  • Cap crawl depth from each seed.
  • Extract links from HTML pages and enqueue newly discovered URLs.
  • Normalize and resolve links robustly — relative links, fragments, default ports, casing, and similar cases.
  • Avoid revisiting the same normalized URL (dedup across both in-queue and already-visited URLs).

Compliance and politeness

  • Respect robots.txt : apply allow/disallow rules per user-agent, cache the rules per host, and honor crawl-delay if present.
  • Enforce per-host politeness / rate limiting — e.g. at most one request per host per X seconds (configurable) — and honor Retry-After on 429 / 503 responses.

Networking

  • Handle redirects : update to the final URL and dedup on the normalized final URL.
  • Handle HTTP errors and timeouts gracefully — do not crash, and back off when appropriate.
  • Filter by content type (e.g. only text/html by default).

Data structures and strategy

  • Describe the data structures for the frontier and the visited set .
  • Describe the duplicate-detection strategy , including how you handle enqueued vs. fetched URLs and redirects .

Termination

The crawl must stop cleanly when there is genuinely no more work — not the instant the queue momentarily drains while a slow fetch is still in flight.

Testing and monitoring

  • Explain how you would test the crawler — unit, integration, concurrency, and fault-injection tests.
  • Describe what you would monitor/measure in a real run — metrics, logs, and alerts.

Deliverables

  • A brief architecture description and rationale .
  • The core algorithm and key components (pseudo-code or a code sketch is fine).
  • A clear description of the data structures and dedup logic .
  • A testing strategy and a monitoring plan .

What a Strong Answer Covers

The interviewer is listening for these dimensions (not these answers — those are for you to produce):

  • Concurrency model justified — why a thread pool fits an I/O-bound workload, and how max_workers is bounded.
  • A clean component decomposition — frontier, fetcher, parser, URL normalizer, robots cache, per-host rate limiter, orchestrator, and how they interact.
  • A precise dedup story — what the canonical key is, where dedup happens (enqueue vs. fetch), and how redirects/canonical URLs fold in.
  • Correct termination — recognizing that an empty queue alone is not "done," and proposing a concrete quiescence-detection mechanism plus a backstop.
  • Politeness done right — robots semantics including failure modes, per-host limiting, and a sound Retry-After /backoff story that doesn't drop throttled work.
  • Robustness — graceful handling of timeouts, HTTP errors, malformed HTML, oversized bodies, and socket hygiene; a single bad page must never kill a worker.
  • Bounded resources — depth, page, byte, and time budgets; awareness of crawler traps / infinite URL spaces.
  • Testability & observability — a layered test plan (with injected time/fault injection) and concrete metrics, structured logs, and alert conditions.
  • Communication — stating assumptions, calling out the non-obvious risks (dedup, termination, leaks) explicitly.

Follow-up Questions

Be ready for the interviewer to push deeper:

  1. What breaks first as you scale this to many millions of URLs on one box — the dedup sets, the frontier, the connection pool, or something else? What do you change?
  2. How would you make this distributed so per-host politeness stays correct across multiple crawler nodes? (Hint to your own thinking: who "owns" a host?)
  3. How would the design change if it had to survive a process restart mid-crawl without re-fetching everything or violating politeness?
  4. Where would async I/O (asyncio/aiohttp) beat the thread pool , and what does it cost you in complexity — when would you not reach for it?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.