Design a distributed web crawler

Q: Design a distributed web crawler

The question evaluates a candidate's ability to design large-scale distributed systems and web-crawling infrastructure, testing competencies such as URL frontier partitioning and deduplication, politeness and rate-limiting, prioritization, retry and idempotency strategies, coordination and backpressure, storage schemas, monitoring, capacity planning, safety controls, and API/data-model design. Commonly asked in System Design interviews to probe architectural thinking and trade-offs around scalability, heterogeneity, reliability, and operational controls, it primarily tests practical application and system-architecture skills while requiring conceptual understanding of distributed-systems principles.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Distributed Web Crawler (1,000 Heterogeneous Workers)

Context

You are asked to design a production-grade web crawler that begins from a single seed URL and scales across 1,000 heterogeneous devices acting as distributed crawl workers. Devices vary in CPU, memory, network quality, and reliability. The system must be safe, polite, and resilient.

Assume: managed devices under your control; cross-Internet crawling; a long-running "campaign" that can be paused/resumed; and a need for near-real-time visibility into progress.

Requirements

Design the system covering the following:

URL frontier partitioning and deduplication

URL canonicalization; per-host/domain sharding; preventing duplicate enqueues; revisit policies.

Politeness policies

robots.txt parsing and caching; honoring crawl-delay and disallow rules; per-host/domain rate limiting and concurrency.

Prioritization

How URLs are ranked (depth, freshness, host/domain fairness, heuristics).

Retry strategy and content deduplication

Transient vs permanent errors; backoff; redirect handling; exact and near-duplicate content detection.

Failure handling and idempotency

Worker crashes; leases; replay; idempotent writes; dead-letter queues.

Coordination, work assignment, backpressure, and fetch semantics

Consistent hashing and/or queues; worker capability awareness; backpressure signaling; exactly-once vs at-least-once effects.

Storage schemas

Fetched pages and blobs; metadata and link graph; robots cache; frontier persistence.

Monitoring, alerting, and debugging

Metrics, logs, traces; per-host dashboards; sampling and replay.

Capacity planning

Estimating throughput, bandwidth, and storage; formulas and a concrete example.

Safety controls

Global and per-host limits; allow/block lists; legal/regulatory controls; kill-switches.

External APIs and data models

Enqueueing new URLs; querying status; fetching results.

Deliverables

High-level architecture with core components and data flows.
Algorithms/policies for the items above.
API shapes and data models (concise JSON examples are fine).
Back-of-the-envelope capacity estimates with assumptions.
Explicit assumptions and guardrails.

Design a distributed web crawler

System Design: Distributed Web Crawler (1,000 Heterogeneous Workers)

Context

Requirements

Deliverables

Solution

Comments (0)

Design a distributed web crawler

Overview

System Design: Distributed Web Crawler (1,000 Heterogeneous Workers)

Context

Requirements

Deliverables

Solution

Comments (0)