This question evaluates a candidate's understanding of concurrent system design, web crawling fundamentals, URL frontier organization, duplicate detection, concurrency control, network I/O models, fault tolerance, and politeness mechanisms such as throttling.
Design a crawler that starts from one seed URL and explores all reachable pages in the same domain efficiently.
Discuss:
Assume the workload is primarily fetching web pages over the network.