This question evaluates skills in web crawling, URL parsing and fragment sanitization, graph traversal for deduplication, and concurrent programming for thread-safe crawling, and it falls under the Coding & Algorithms domain.

You are given:
startUrl
(e.g.,
"http://news.example.com/a/index.html"
).
HtmlParser
with a method
getUrls(url)
that returns
all URLs
found on the page at
url
.
Your task is to crawl web pages starting from startUrl and return all unique pages that are reachable by following links, subject to the rules below.
startUrl
.
http://
or
https://
) and the next
/
.
#
(e.g.,
http://a.com/x#section2
).
#
and everything after it) before:
http://a.com/x#p1
and
http://a.com/x#p2
should be treated as the
same page
:
http://a.com/x
.
Implement a single-threaded crawler that returns the set/list of visited URLs (after stripping fragments), restricted to the same hostname.
Now implement a multi-threaded crawler to speed up crawling.
HtmlParser.getUrls
concurrently.
Return all unique sanitized URLs that are reachable from startUrl and satisfy the hostname constraint.
HtmlParser.getUrls(url)
is a black box and may be slow (network/IO-like).
getUrls
are absolute URLs.