Examples
Input: ('http://news.example.com/a/index.html#top', {'http://news.example.com/a/index.html': ['http://news.example.com/b#section', 'http://other.example.com/x', 'http://news.example.com/c', 'http://news.example.com/b#other'], 'http://news.example.com/b': ['http://news.example.com/c#frag', 'http://news.example.com/a/index.html#again'], 'http://news.example.com/c': ['http://news.example.com/c#self']})
Expected Output: ['http://news.example.com/a/index.html', 'http://news.example.com/b', 'http://news.example.com/c']
Explanation: The crawler starts at the fragment-free URL `http://news.example.com/a/index.html`, ignores the external hostname, removes fragments before deduplication, and reaches pages a, b, and c.
Input: ('https://a.com/home', {'https://a.com/home': ['https://a.com/about#team', 'https://b.com/out', 'https://a.com/about#company'], 'https://a.com/about': ['https://a.com/home#top', 'https://a.com/about#self']})
Expected Output: ['https://a.com/about', 'https://a.com/home']
Explanation: Only URLs on hostname `a.com` are crawled. The two fragment variants of the about page collapse into one visited URL.
Input: ('http://a.com#start', {'http://a.com': ['http://a.com/#top', 'http://a.com:80/#frag', 'https://a.com/path#x', 'http://b.com/page'], 'http://a.com/': ['http://a.com#again'], 'http://a.com:80/': []})
Expected Output: ['http://a.com', 'http://a.com/', 'http://a.com:80/', 'https://a.com/path']
Explanation: All returned URLs share hostname `a.com`, but they remain distinct because only fragments are removed. The missing page `https://a.com/path` is still visited and treated as having no outgoing links.
Input: ('https://solo.example.org/page#frag', {})
Expected Output: ['https://solo.example.org/page']
Explanation: Even if the start page is missing from `web`, it is still reachable from itself after fragment removal.