Examples
Input: ('https://Example.com:443/a/index.html#top', {'https://example.com/a/index.html': ['../about', '/contact#team', 'https://other.com/home', '../about#bio', '/a/./products'], 'https://example.com/about': ['team', '/contact', 'https://example.com:443/a/index.html'], 'https://example.com/contact': ['/missing', '/about#again'], 'https://example.com/a/products': [], 'https://example.com/team': []}, {'https://example.com/contact': 1, 'https://example.com/missing': -1})
Expected Output: ['https://example.com/a/index.html', 'https://example.com/about', 'https://example.com/contact', 'https://example.com/a/products', 'https://example.com/team']
Explanation: The crawler normalizes the start URL, ignores the external domain, deduplicates repeated links to /about, retries /contact once before succeeding, and skips /missing because it permanently fails.
Input: ('http://site.com', {'http://site.com/': ['/slow', '/ok', '/ok#frag'], 'http://site.com/slow': [], 'http://site.com/ok': ['/child'], 'http://site.com/child': []}, {'http://site.com/slow': 4})
Expected Output: ['http://site.com/', 'http://site.com/ok', 'http://site.com/child']
Explanation: The root page is visited first. /slow is discovered before /ok, but it needs 4 transient failures before success, which exceeds the retry limit of 3 retries, so it is skipped. /ok and then /child are visited.
Input: ('https://a.com/start', {'https://a.com/start': ['/x'], 'https://a.com/x': []}, {'https://a.com/start': -1})
Expected Output: []
Explanation: The start page permanently fails, so nothing is successfully visited.
Input: ('https://docs.site.com/guide/', {'https://docs.site.com/guide/': ['./intro', '../guide/./faq#top', 'https://docs.site.com:443/guide/intro', 'https://blog.site.com/post', '../guide/intro/../api'], 'https://docs.site.com/guide/intro': [], 'https://docs.site.com/guide/faq': [], 'https://docs.site.com/guide/api': []}, {})
Expected Output: ['https://docs.site.com/guide/', 'https://docs.site.com/guide/intro', 'https://docs.site.com/guide/faq', 'https://docs.site.com/guide/api']
Explanation: The crawler preserves BFS order, resolves relative links from a directory URL, removes fragments and default ports, ignores the different hostname blog.site.com, and visits each normalized URL once.