This question evaluates understanding of distributed systems and large-scale web crawling, including coordination and state management, URL deduplication and scheduling, politeness and per-host rate limiting, fault tolerance, load balancing, and storage for fetched content.
Design a production-ready web crawler that starts from a single seed URL and scales crawling across 1,000 heterogeneous devices. The crawler should respect robots.txt and per-host politeness constraints, deduplicate URLs/content, and persist pages and metadata.
Login required