Crawl Same-Domain Links
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Technical Screen
Implement a Python function that crawls a website starting from a seed URL and returns all unique pages reachable within the same domain.
Assume a helper function `get_links(url) -> list[str]` is available and returns all hyperlinks found on that page.
Requirements:
- Start from a single input URL.
- Only follow links whose hostname matches the hostname of the seed URL.
- Do not visit the same URL more than once.
- Handle cycles safely.
- Continue until there are no more unseen same-domain links.
You may choose either BFS or DFS traversal. Be prepared to discuss how you would normalize URLs to reduce duplicate visits.
Quick Answer: This question evaluates understanding of web crawling and link traversal algorithms, including URL normalization, same-domain filtering, deduplication, and cycle-safe graph traversal.
In a real crawler, you might be given a helper function `get_links(url)` that returns all hyperlinks found on a page. For this coding problem, that helper is simulated by a dictionary `web`, where each key is a URL and each value is the list of hyperlinks on that page.
Implement `solution(start_url, web)` to crawl the site starting from `start_url` and return every unique page reachable within the same domain.
Rules:
- Start from exactly one seed URL.
- Only follow links whose hostname exactly matches the hostname of the seed URL.
- Do not visit the same page more than once.
- Handle cycles safely.
- Continue until there are no more unseen same-domain links.
- If a reachable URL is not present as a key in `web`, treat it as a page with no outgoing links.
To make the result deterministic and to reduce duplicate visits, normalize every URL before comparing or storing it:
- lowercase the scheme and hostname
- remove the fragment part after `#`
- if the path is empty, treat it as `/`
Return the visited normalized URLs in lexicographic order.
Note: subdomains do not count as the same domain. For example, `example.com` and `blog.example.com` are different hostnames.
Constraints
- 0 <= len(web) <= 10^4
- The total number of hyperlinks across all pages is at most 2 * 10^5
- All URLs are absolute HTTP/HTTPS-style URLs
- Same-domain means exact hostname match after normalization
Examples
Input: ('https://example.com', {'https://example.com': ['https://example.com/about', 'https://other.com/', 'https://example.com/blog'], 'https://example.com/about': ['https://example.com', 'https://example.com/contact'], 'https://example.com/blog': ['https://example.com/about', 'https://blog.example.com/post'], 'https://example.com/contact': []})