Implement a web crawler that starts from a given URL and visits every reachable page on the same host.
You are given:
-
start_url
: a string representing the first page to crawl.
-
fetch_links(url) -> list[str]
: a blocking function that returns all URLs found on the page.
Requirements:
-
Crawl pages concurrently using multiple threads.
-
Only visit URLs whose host is exactly the same as the host of
start_url
.
-
Visit each eligible URL at most once, even if multiple pages link to it.
-
Return all visited URLs in any order.
-
Your implementation must be thread-safe.
You may assume:
-
fetch_links
is thread-safe.
-
URL strings are valid.
-
The crawl graph is finite.
Discuss the main synchronization challenges, such as protecting the shared visited set and coordinating worker threads efficiently.