PracHub
QuestionsPremiumLearningGuidesInterview PrepCoaches
|Home/System Design/Anthropic

Design a scalable web crawler

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of scalable web crawler architecture, distributed systems concepts, URL and content deduplication, scheduling and prioritization, storage and metadata design, and operational concerns such as politeness, rate limiting, DNS/connection management, and fault tolerance.

  • hard
  • Anthropic
  • System Design
  • Software Engineer

Design a scalable web crawler

Company: Anthropic

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

Design a scalable web crawler that discovers and downloads web pages across the public internet. Specify the architecture (URL frontier, fetchers, parsers, storage), how you respect robots.txt and crawl-delay, how you deduplicate URLs and content, and how you prioritize and schedule crawling. Follow-up: extend the crawler to use multithreading and/or multiple machines—explain concurrency controls, per-host rate limiting, back-pressure, fault tolerance, and how you ensure exactly-once or at-least-once processing.

Quick Answer: This question evaluates understanding of scalable web crawler architecture, distributed systems concepts, URL and content deduplication, scheduling and prioritization, storage and metadata design, and operational concerns such as politeness, rate limiting, DNS/connection management, and fault tolerance.

Related Interview Questions

  • Design a prompt playground - Anthropic (hard)
  • Scale Duplicate File Detection - Anthropic (medium)
  • Design a one-to-one chat system - Anthropic (medium)
  • Design One-to-One Chat - Anthropic (medium)
  • How to stream a large file to 1000 hosts fastest - Anthropic (medium)
Anthropic logo
Anthropic
Jul 26, 2025, 12:00 AM
Software Engineer
Onsite
System Design
25
0

System Design: Scalable Web Crawler

Context

Design a production-ready web crawler that discovers and downloads publicly accessible web pages at internet scale. Your design should support continual discovery, politeness (respect for publishers), and high throughput while avoiding duplicates and crawler traps.

Assume we start with a list of seed URLs and aim to crawl and recrawl billions of pages over time. The crawler should be modular so it can run on a single machine for small jobs and scale out to a distributed cluster.

Requirements

  1. Architecture
    • Define core components: URL frontier, fetchers, parsers, storage, metadata/indexing, and coordination.
    • Include DNS resolution, connection management, and content-type handling.
  2. Robots and Politeness
    • How to fetch and cache robots.txt; obey user-agent rules and crawl-delay directives.
    • Per-host/per-domain rate limiting and connection concurrency.
  3. Deduplication
    • URL deduplication via canonicalization and a global "seen" structure.
    • Content deduplication (exact and near-duplicate pages).
  4. Prioritization and Scheduling
    • How to prioritize which URLs to crawl next (e.g., depth, quality, freshness, domain budgets).
    • Recrawl scheduling for freshness.
  5. Storage and Metadata
    • Where to store raw content (blobs) and structured metadata (fetch status, fingerprints, link graph, robots cache).
  6. Scale and Throughput Targets
    • Make reasonable assumptions (e.g., initial 100M URLs, ~10k fetches/sec target) and reflect them in your choices.
  7. Follow-up: Concurrency and Distribution
    • Extend to multithreaded and multi-machine operation.
    • Explain: concurrency controls, per-host rate limiting, back-pressure, fault tolerance, and processing semantics (exactly-once vs at-least-once).

Deliverables: A clear architecture, key data structures and algorithms, scheduling logic, and operational considerations.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic System Design•Software Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.