This question evaluates competency in concurrent systems, scalable data ingestion, and algorithmic design for a high-throughput web crawler, covering URL normalization, deduplication, per-host rate limiting, fault-tolerant checkpointing, and bounded-memory queueing.
Design and code (pseudocode acceptable) a multi-threaded web crawler that favors breadth-first discovery while continuously running analysis tasks on fetched pages. Constraints: 1) respect robots.txt and per-host rate limits; 2) dedupe URLs (including normalization and canonical redirects) with at-most-once processing; 3) bounded memory—spill queues to disk when needed; 4) back-pressure so analysis cannot starve crawling and vice versa; 5) graceful shutdown and exactly-once checkpointing on restart. Answer details: a) define your core data structures (frontier, seen-set, per-host token buckets) and their big-O behavior; b) show how you avoid deadlocks and priority inversion (e.g., work-stealing, fine-grained locks, or lock-free queues) and how you detect/handle thread-safety bugs; c) explain how BFS ordering degrades with many hosts and how you would approximate it; d) specify the metrics you would track to verify a 50%+ throughput gain (and how you’d run a controlled benchmark to prove it).