PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/System Design/Anthropic

Design a concurrent web crawler

Last updated: May 8, 2026

Quick Overview

This question evaluates understanding of concurrent system design, networking and parsing concerns, URL normalization and deduplication strategies, per-host politeness and rate limiting, error handling and observability for building robust web crawlers.

  • hard
  • Anthropic
  • System Design
  • Software Engineer

Design a concurrent web crawler

Company: Anthropic

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

Design and implement a basic web crawler that fetches pages concurrently using a thread executor. Requirements: accept one or more seed URLs; use robust URL parsing to normalize and resolve links; avoid revisiting the same normalized URL; respect robots.txt and per-host politeness (rate limiting); cap concurrency and depth; optionally restrict to same-origin. Handle redirects, HTTP errors, timeouts, and content-type filtering. Describe data structures for the frontier and visited set, duplicate detection strategy, and how you would test and monitor the crawler.

Quick Answer: This question evaluates understanding of concurrent system design, networking and parsing concerns, URL normalization and deduplication strategies, per-host politeness and rate limiting, error handling and observability for building robust web crawlers.

Related Interview Questions

  • Design a one-to-one chat system - Anthropic (medium)
  • Design One-to-One Chat - Anthropic (medium)
  • How to stream a large file to 1000 hosts fastest - Anthropic (medium)
  • Design guardrails and fallback for LLM reliability - Anthropic (hard)
  • Design a Crash-Resilient LRU Cache - Anthropic (hard)
Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
System Design
52
0

System Design: Concurrent Web Crawler (Threads)

You are asked to design and implement a basic web crawler that fetches pages concurrently using a thread executor. The crawler should be production-conscious (correctness, robustness, and observability) while remaining reasonably simple.

Requirements

  • Input
    1. Accept one or more seed URLs.
    2. Optional flag to restrict crawling to the same origin as the seeds (scheme, host, port).
  • Crawling behavior
    1. Fetch pages concurrently using a thread executor with a configurable max worker count.
    2. Cap crawl depth from each seed.
    3. Extract links from HTML pages and enqueue newly discovered URLs.
    4. Normalize and resolve links robustly (relative links, fragments, default ports, casing, etc.).
    5. Avoid revisiting the same normalized URL (dedup across in-queue and visited).
  • Compliance and politeness
    1. Respect robots.txt (allow/disallow rules per user-agent; cache per host; honor crawl-delay if present).
    2. Per-host politeness/rate limiting (e.g., at most 1 request per host per X seconds, configurable; honor Retry-After on 429/503).
  • Networking
    1. Handle redirects (update to final URL; dedup on the normalized final URL).
    2. Handle HTTP errors and timeouts gracefully (do not crash; backoff when appropriate).
    3. Filter by content type (e.g., only text/html by default).
  • Data structures and strategy
    1. Describe the frontier and visited set data structures.
    2. Describe the duplicate detection strategy (including enqueued vs. fetched URLs and redirects).
  • Testing and monitoring
    1. Explain how you would test the crawler (unit, integration, concurrency, and fault-injection tests).
    2. Describe what you would monitor/measure in a real run (metrics, logs, alerts).

Deliverables

  • A brief architecture description and rationale.
  • Core algorithm and key components (pseudo-code or code sketch is fine).
  • Clear description of data structures and dedup logic.
  • Testing strategy and monitoring plan.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic System Design•Software Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.