Scale crawler with thread pool
Company: Anthropic
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
Refactor the crawler to run concurrently using a bounded thread pool. Design a thread-safe URL frontier (work queue) and a thread-safe visited set to prevent duplicate fetches. Explain worker lifecycle, task acquisition, and termination conditions for the pool. Describe rate limiting per host, timeouts, retries for transient failures, and how to support cancellation and graceful shutdown. Compare approaches—coarse-grained locks, fine-grained locks, lock-free/concurrent data structures, and message-queue based designs—and analyze trade-offs in contention, throughput, fairness, and memory usage.
Quick Answer: This question evaluates system design and concurrent programming competencies, focusing on bounded thread-pool architecture, thread-safe frontier and deduplication structures, per-host politeness and in-flight limits, reliability mechanisms (timeouts, retries, backoff), and graceful shutdown and termination semantics.