Build a concurrent web crawler
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Implement a web crawler that starts from a given URL and visits every reachable page on the same host.
You are given:
- `start_url`: a string representing the first page to crawl.
- `fetch_links(url) -> list[str]`: a blocking function that returns all URLs found on the page.
Requirements:
1. Crawl pages concurrently using multiple threads.
2. Only visit URLs whose host is exactly the same as the host of `start_url`.
3. Visit each eligible URL at most once, even if multiple pages link to it.
4. Return all visited URLs in any order.
5. Your implementation must be thread-safe.
You may assume:
- `fetch_links` is thread-safe.
- URL strings are valid.
- The crawl graph is finite.
Discuss the main synchronization challenges, such as protecting the shared visited set and coordinating worker threads efficiently.
Quick Answer: This question evaluates understanding of concurrent programming, synchronization, thread safety, and graph traversal as applied to a multithreaded web crawler.
You are given a starting URL and a directed graph of web pages. The graph is represented by a dictionary where each key is a URL and its value is the list of URLs found on that page.
Implement the core logic of a concurrent web crawler: starting from `start_url`, crawl every reachable page that belongs to the same hostname as `start_url`. Ignore links to other hostnames. Pages may contain duplicate links, cycles, and some discovered URLs may not appear as keys in the dictionary (treat those pages as having no outgoing links).
Because a real concurrent crawler can visit pages in any order, your function must return the final set of crawled URLs sorted in lexicographical order.
Constraints
- `start_url` is a non-empty string and represents a valid URL beginning with `http://`.
- The number of pages in `links` can be up to 10^4.
- The total number of links across all pages can be up to 10^5.
- There may be cycles, duplicate links, and URLs in link lists that are not keys in `links`.
Examples
Input: ("http://news.com", {"http://news.com": ["http://news.com/world", "http://sports.com", "http://news.com/about"], "http://news.com/world": ["http://news.com", "http://news.com/world/europe"], "http://news.com/about": [], "http://news.com/world/europe": []})
Expected Output: ["http://news.com", "http://news.com/about", "http://news.com/world", "http://news.com/world/europe"]
Explanation: Starting from news.com, only URLs on the hostname news.com are crawled. The sports.com link is ignored.