Design a distributed web crawler

Q: Design a distributed web crawler

This is a System Design interview question from Lyft for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Distributed Web Crawler (1,000 Heterogeneous Workers)

Context

You are asked to design a production-grade web crawler that begins from a single seed URL and scales across 1,000 heterogeneous devices acting as distributed crawl workers. Devices vary in CPU, memory, network quality, and reliability. The system must be safe, polite, and resilient.

Assume: managed devices under your control; cross-Internet crawling; a long-running "campaign" that can be paused/resumed; and a need for near-real-time visibility into progress.

Requirements

Design the system covering the following:

URL frontier partitioning and deduplication

URL canonicalization; per-host/domain sharding; preventing duplicate enqueues; revisit policies.

Politeness policies

robots.txt parsing and caching; honoring crawl-delay and disallow rules; per-host/domain rate limiting and concurrency.

Prioritization

How URLs are ranked (depth, freshness, host/domain fairness, heuristics).

Retry strategy and content deduplication

Transient vs permanent errors; backoff; redirect handling; exact and near-duplicate content detection.

Failure handling and idempotency

Worker crashes; leases; replay; idempotent writes; dead-letter queues.

Coordination, work assignment, backpressure, and fetch semantics

Consistent hashing and/or queues; worker capability awareness; backpressure signaling; exactly-once vs at-least-once effects.

Storage schemas

Fetched pages and blobs; metadata and link graph; robots cache; frontier persistence.

Monitoring, alerting, and debugging

Metrics, logs, traces; per-host dashboards; sampling and replay.

Capacity planning

Estimating throughput, bandwidth, and storage; formulas and a concrete example.

Safety controls

Global and per-host limits; allow/block lists; legal/regulatory controls; kill-switches.

External APIs and data models

Enqueueing new URLs; querying status; fetching results.

Deliverables

High-level architecture with core components and data flows.
Algorithms/policies for the items above.
API shapes and data models (concise JSON examples are fine).
Back-of-the-envelope capacity estimates with assumptions.
Explicit assumptions and guardrails.

Design a distributed web crawler

System Design: Distributed Web Crawler (1,000 Heterogeneous Workers)

Context

Requirements

Deliverables

Solution (Locked)

Comments (0)