Design a concurrent web crawler
Company: Anthropic
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
Design and implement a web crawler. First, build a single‑threaded version that, given a set of seed URLs, fetches pages, extracts links, deduplicates visits, respects robots.txt, and stops at a configurable depth and/or domain scope. Next, extend it to a concurrent version and compare three approaches:
(a) manual multithreading using queues and locks,
(b) a fixed‑size thread pool, and
(c) an asyncio/event‑loop model. For each approach, explain and implement how you: maintain a URL frontier, enforce per‑host politeness and rate limits, avoid duplicate fetches, handle failures/retries and backoff, manage back‑pressure, and perform graceful shutdown. Discuss data structures (e.g., visited sets, frontier queues, per‑host buckets), synchronization primitives, and mechanisms to prevent deadlocks/starvation. Analyze time/space complexity and evaluate correctness, liveness, and performance trade‑offs across the three concurrency models.
Quick Answer: This question evaluates a candidate's skills in concurrent system design, including concurrency primitives, synchronization, URL frontier management, per-host politeness and rate-limiting, fault tolerance, and trade-off analysis for a web crawler.