PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/System Design/Lyft

Design a distributed web crawler

Last updated: Mar 29, 2026

Quick Overview

The question evaluates a candidate's ability to design large-scale distributed systems and web-crawling infrastructure, testing competencies such as URL frontier partitioning and deduplication, politeness and rate-limiting, prioritization, retry and idempotency strategies, coordination and backpressure, storage schemas, monitoring, capacity planning, safety controls, and API/data-model design. Commonly asked in System Design interviews to probe architectural thinking and trade-offs around scalability, heterogeneity, reliability, and operational controls, it primarily tests practical application and system-architecture skills while requiring conceptual understanding of distributed-systems principles.

  • hard
  • Lyft
  • System Design
  • Software Engineer

Design a distributed web crawler

Company: Lyft

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

Design a web crawler that starts from a single seed URL and scales across 1,000 heterogeneous devices acting as distributed crawl workers. Cover: URL frontier partitioning and deduplication, politeness policies (robots.txt, per-host rate limiting), prioritization, retry strategy, and content deduplication; failure handling and idempotency; coordination and work assignment (e.g., consistent hashing, queues), backpressure, and exactly-once/at-least-once fetch semantics; storage schemas for fetched pages and metadata; monitoring, alerting, and debugging; estimating throughput, bandwidth, and storage; safety controls to avoid overload or legal violations. Provide APIs and data models for enqueueing, status, and results.

Quick Answer: The question evaluates a candidate's ability to design large-scale distributed systems and web-crawling infrastructure, testing competencies such as URL frontier partitioning and deduplication, politeness and rate-limiting, prioritization, retry and idempotency strategies, coordination and backpressure, storage schemas, monitoring, capacity planning, safety controls, and API/data-model design. Commonly asked in System Design interviews to probe architectural thinking and trade-offs around scalability, heterogeneity, reliability, and operational controls, it primarily tests practical application and system-architecture skills while requiring conceptual understanding of distributed-systems principles.

Related Interview Questions

  • Design a Donation Platform - Lyft (hard)
  • Design a charity donation platform - Lyft (medium)
  • Design a scalable real-time chat system - Lyft (hard)
  • Design web crawler for 1000 devices - Lyft (hard)
  • Design a scalable news feed system - Lyft (hard)
Lyft logo
Lyft
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
System Design
8
0

System Design: Distributed Web Crawler (1,000 Heterogeneous Workers)

Context

You are asked to design a production-grade web crawler that begins from a single seed URL and scales across 1,000 heterogeneous devices acting as distributed crawl workers. Devices vary in CPU, memory, network quality, and reliability. The system must be safe, polite, and resilient.

Assume: managed devices under your control; cross-Internet crawling; a long-running "campaign" that can be paused/resumed; and a need for near-real-time visibility into progress.

Requirements

Design the system covering the following:

  1. URL frontier partitioning and deduplication
  • URL canonicalization; per-host/domain sharding; preventing duplicate enqueues; revisit policies.
  1. Politeness policies
  • robots.txt parsing and caching; honoring crawl-delay and disallow rules; per-host/domain rate limiting and concurrency.
  1. Prioritization
  • How URLs are ranked (depth, freshness, host/domain fairness, heuristics).
  1. Retry strategy and content deduplication
  • Transient vs permanent errors; backoff; redirect handling; exact and near-duplicate content detection.
  1. Failure handling and idempotency
  • Worker crashes; leases; replay; idempotent writes; dead-letter queues.
  1. Coordination, work assignment, backpressure, and fetch semantics
  • Consistent hashing and/or queues; worker capability awareness; backpressure signaling; exactly-once vs at-least-once effects.
  1. Storage schemas
  • Fetched pages and blobs; metadata and link graph; robots cache; frontier persistence.
  1. Monitoring, alerting, and debugging
  • Metrics, logs, traces; per-host dashboards; sampling and replay.
  1. Capacity planning
  • Estimating throughput, bandwidth, and storage; formulas and a concrete example.
  1. Safety controls
  • Global and per-host limits; allow/block lists; legal/regulatory controls; kill-switches.
  1. External APIs and data models
  • Enqueueing new URLs; querying status; fetching results.

Deliverables

  • High-level architecture with core components and data flows.
  • Algorithms/policies for the items above.
  • API shapes and data models (concise JSON examples are fine).
  • Back-of-the-envelope capacity estimates with assumptions.
  • Explicit assumptions and guardrails.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Lyft•More Software Engineer•Lyft Software Engineer•Lyft System Design•Software Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.