Implement a web crawler for a single website.
You are given:
-
A starting URL
startUrl
-
A function
fetch(url)
that returns the HTML content of a page or raises an error
-
A function
extractLinks(html)
that returns all links found on the page
Write a crawler that:
-
Traverses pages in breadth-first order starting from
startUrl
.
-
Only visits pages in the same domain as
startUrl
.
-
Normalizes URLs by resolving relative paths, removing fragments, and treating equivalent URLs as the same page.
-
Retries transient fetch failures up to 3 times before giving up.
-
Skips permanently failed pages.
-
Never visits the same normalized URL more than once.
Return the list of visited URLs in BFS order.
In your implementation, be prepared to explain your queue and visited-set design, how you detect duplicate URLs, and how retry logic interacts with the crawl loop.