This question evaluates the ability to design scalable, fault-tolerant distributed systems for web crawling, covering competencies such as URL deduplication, politeness (rate limiting and robots.
Design a web crawler that starts from one or more seed URLs and continuously discovers and fetches pages.
robots.txt
.
How would you redesign/optimize the crawler when you have multiple servers (many crawler workers)? You don’t need to implement code—describe the architecture and key data structures/services.