Crawl Same-Domain Links
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of web crawling and link traversal algorithms, including URL normalization, same-domain filtering, deduplication, and cycle-safe graph traversal.
Constraints
- 0 <= len(web) <= 10^4
- The total number of hyperlinks across all pages is at most 2 * 10^5
- All URLs are absolute HTTP/HTTPS-style URLs
- Same-domain means exact hostname match after normalization
Examples
Input: ('https://example.com', {'https://example.com': ['https://example.com/about', 'https://other.com/', 'https://example.com/blog'], 'https://example.com/about': ['https://example.com', 'https://example.com/contact'], 'https://example.com/blog': ['https://example.com/about', 'https://blog.example.com/post'], 'https://example.com/contact': []})
Expected Output: ['https://example.com/', 'https://example.com/about', 'https://example.com/blog', 'https://example.com/contact']
Explanation: The crawler stays on hostname `example.com`, ignores `other.com` and `blog.example.com`, and safely handles the cycle back to the home page.
Input: ('HTTPS://Example.com#top', {'https://example.com/': ['https://example.com/docs#section1', 'https://example.com/docs', 'https://example.com/faq'], 'https://example.com/docs': ['https://example.com/#footer'], 'https://example.com/faq': ['https://example.com/docs#section2']})
Expected Output: ['https://example.com/', 'https://example.com/docs', 'https://example.com/faq']
Explanation: The start URL is normalized to `https://example.com/`. Fragments like `#top` and `#section1` are removed, so duplicate visits are avoided.
Input: ('https://site.com/home', {'https://site.com/home': ['https://site.com/a', 'https://news.site.com/a', 'https://site.com/b'], 'https://site.com/a': ['https://site.com/missing', 'https://site.com/home'], 'https://site.com/b': []})
Expected Output: ['https://site.com/a', 'https://site.com/b', 'https://site.com/home', 'https://site.com/missing']
Explanation: The crawler ignores the subdomain `news.site.com`. It still includes `https://site.com/missing` because it is reachable, even though it has no entry in `web`.
Input: ('https://alone.com', {})
Expected Output: ['https://alone.com/']
Explanation: Even when the seed page is not present in `web`, it is still a reachable page and should be returned after normalization.
Input: ('https://dup.com', {'https://dup.com': ['https://dup.com/page#one'], 'https://dup.com#frag': ['https://dup.com/page'], 'https://dup.com/page': ['https://dup.com/#top', 'https://other.com/page']})
Expected Output: ['https://dup.com/', 'https://dup.com/page']
Explanation: Both `https://dup.com` and `https://dup.com#frag` normalize to the same page, and fragment-only differences do not cause duplicate visits.
Hints
- Model the website as a graph: pages are nodes and hyperlinks are directed edges. A stack or queue plus a visited set is enough.
- Use URL parsing to extract the hostname and to normalize each URL before storing it, so fragments and case differences do not create duplicate visits.