This question evaluates understanding of web crawling mechanics, URL/hostname filtering, graph traversal concepts, and concurrent fetching, assessing skills in reachability determination, duplicate detection, and thread-safety.
You are implementing a simple web crawler.
startUrl
.
List<String> getUrls(String url)
which returns all URLs (as strings) found on the page at
url
.
Return all unique URLs that are reachable from startUrl by repeatedly calling getUrls, subject to these constraints:
startUrl.
How would you modify your crawler to use multiple threads to improve throughput while still ensuring: