Build a concurrent web crawler
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of concurrent programming, synchronization, thread safety, and graph traversal as applied to a multithreaded web crawler.
Constraints
- `start_url` is a non-empty string and represents a valid URL beginning with `http://`.
- The number of pages in `links` can be up to 10^4.
- The total number of links across all pages can be up to 10^5.
- There may be cycles, duplicate links, and URLs in link lists that are not keys in `links`.
Examples
Input: ("http://news.com", {"http://news.com": ["http://news.com/world", "http://sports.com", "http://news.com/about"], "http://news.com/world": ["http://news.com", "http://news.com/world/europe"], "http://news.com/about": [], "http://news.com/world/europe": []})
Expected Output: ["http://news.com", "http://news.com/about", "http://news.com/world", "http://news.com/world/europe"]
Explanation: Starting from news.com, only URLs on the hostname news.com are crawled. The sports.com link is ignored.
Input: ("http://a.com/x", {"http://a.com/x": ["http://a.com/y", "http://a.com/y", "http://b.com/z"], "http://a.com/y": ["http://a.com/x", "http://a.com/z"], "http://a.com/z": ["http://a.com/z"]})
Expected Output: ["http://a.com/x", "http://a.com/y", "http://a.com/z"]
Explanation: Duplicate links and cycles should not cause repeated crawling. The b.com URL is a different hostname and is skipped.
Input: ("http://solo.com/page", {})
Expected Output: ["http://solo.com/page"]
Explanation: Even if the page is not present in the mapping, the crawler still visits the start page itself.
Input: ("http://x.com", {"http://x.com": ["http://x.com/a", "http://x.com/b"], "http://x.com/b": ["http://x.com/c"]})
Expected Output: ["http://x.com", "http://x.com/a", "http://x.com/b", "http://x.com/c"]
Explanation: Discovered same-host pages are included even if some of them are not keys in the dictionary; they are treated as pages with no outgoing links.
Hints
- Model the website as a graph and use DFS or BFS starting from `start_url`.
- Write a helper that extracts the hostname: the part after `http://` and before the next `/` (or the end of the string).