PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Anthropic

Design a concurrent web crawler

Last updated: May 8, 2026

Quick Overview

This question evaluates understanding of concurrent system design, networking and parsing concerns, URL normalization and deduplication strategies, per-host politeness and rate limiting, error handling and observability for building robust web crawlers.

  • hard
  • Anthropic
  • System Design
  • Software Engineer

Design a concurrent web crawler

Company: Anthropic

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

Design and implement a basic web crawler that fetches pages concurrently using a thread executor. Requirements: accept one or more seed URLs; use robust URL parsing to normalize and resolve links; avoid revisiting the same normalized URL; respect robots.txt and per-host politeness (rate limiting); cap concurrency and depth; optionally restrict to same-origin. Handle redirects, HTTP errors, timeouts, and content-type filtering. Describe data structures for the frontier and visited set, duplicate detection strategy, and how you would test and monitor the crawler.

Quick Answer: This question evaluates understanding of concurrent system design, networking and parsing concerns, URL normalization and deduplication strategies, per-host politeness and rate limiting, error handling and observability for building robust web crawlers.

Related Interview Questions

  • Design a One-on-One Chat Service - Anthropic (medium)
  • Design a prompt playground - Anthropic (hard)
  • Scale Duplicate File Detection - Anthropic (medium)
  • Design a one-to-one chat system - Anthropic (medium)
  • Design One-to-One Chat - Anthropic (medium)
Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
System Design
75
0

System Design: Concurrent Web Crawler (Threads)

Design and implement a basic web crawler that fetches pages concurrently using a thread executor (e.g. a ThreadPoolExecutor). The crawler should be production-conscious — correct, robust, and observable — while remaining reasonably simple. Where it helps, prefer standard URL parsing utilities (such as Python's urlparse/urljoin) for handling URLs.

You may present your answer as a brief design plus a code or pseudo-code sketch of the core algorithm and key components.

Input

  • Accept one or more seed URLs .
  • Support an optional flag to restrict crawling to the same origin as the seeds (matching on scheme, host, and port ).

Crawling behavior

  • Fetch pages concurrently via a thread executor with a configurable max worker count .
  • Cap crawl depth from each seed.
  • Extract links from HTML pages and enqueue newly discovered URLs.
  • Normalize and resolve links robustly — relative links, fragments, default ports, casing, and similar cases.
  • Avoid revisiting the same normalized URL (dedup across both in-queue and already-visited URLs).

Compliance and politeness

  • Respect robots.txt : apply allow/disallow rules per user-agent, cache the rules per host, and honor crawl-delay if present.
  • Enforce per-host politeness / rate limiting — e.g. at most one request per host per X seconds (configurable) — and honor Retry-After on 429 / 503 responses.

Networking

  • Handle redirects : update to the final URL and dedup on the normalized final URL.
  • Handle HTTP errors and timeouts gracefully — do not crash, and back off when appropriate.
  • Filter by content type (e.g. only text/html by default).

Data structures and strategy

  • Describe the data structures for the frontier and the visited set .
  • Describe the duplicate-detection strategy , including how you handle enqueued vs. fetched URLs and redirects .

Testing and monitoring

  • Explain how you would test the crawler — unit, integration, concurrency, and fault-injection tests.
  • Describe what you would monitor/measure in a real run — metrics, logs, and alerts.

Deliverables

  • A brief architecture description and rationale .
  • The core algorithm and key components (pseudo-code or a code sketch is fine).
  • A clear description of the data structures and dedup logic .
  • A testing strategy and a monitoring plan .

Solution

Show

Submit Your Answer

Earn up to 20 XP/day for approved answers

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.