Implement a web crawler using a provided API
Company: Anthropic
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
Quick Answer: This question evaluates understanding of web crawling mechanics, URL/hostname filtering, graph traversal concepts, and concurrent fetching, assessing skills in reachability determination, duplicate detection, and thread-safety.
Part 1: Single-Threaded Same-Host Web Crawler
Constraints
- 1 <= len(start_url) <= 200
- 0 <= sum(len(links) for links in web.values()) <= 200000
- Each URL is a valid string starting with `http://` or `https://`.
- If a reachable URL is missing from `web`, it has no outgoing links.
- Preserve the given order of links when exploring neighbors during BFS.
Examples
Input: ('http://news.com', {'http://news.com': ['http://news.com/a', 'http://other.com/x', 'http://news.com/b'], 'http://news.com/a': ['http://news.com/b', 'http://news.com/c'], 'http://news.com/b': ['http://news.com'], 'http://news.com/c': [], 'http://other.com/x': ['http://news.com/c']})
Expected Output: ['http://news.com', 'http://news.com/a', 'http://news.com/b', 'http://news.com/c']
Explanation: The crawler ignores `other.com`, visits same-host pages once, and follows BFS order.
Input: ('https://site.org/home', {'https://site.org/home': ['http://site.org/a', 'https://site.org/b', 'https://othersite.org/z'], 'http://site.org/a': ['https://site.org/home', 'https://site.org/c'], 'https://site.org/b': [], 'https://site.org/c': []})
Expected Output: ['https://site.org/home', 'http://site.org/a', 'https://site.org/b', 'https://site.org/c']
Explanation: Both `http://site.org/...` and `https://site.org/...` share the same hostname, so both are crawlable.
Input: ('http://solo.com', {})
Expected Output: ['http://solo.com']
Explanation: Edge case: the start page is not in the graph, so it is the only visited URL.
Input: ('http://a.com/x', {'http://a.com/x': ['http://a.com/x', 'http://a.com/y', 'http://a.com/y'], 'http://a.com/y': ['http://b.com/q']})
Expected Output: ['http://a.com/x', 'http://a.com/y']
Explanation: Self-loops and duplicate links should not cause repeated visits, and `b.com` is filtered out.
Hints
- Use a queue for BFS and a set to remember which URLs have already been seen.
- Mark a URL as visited when you enqueue it, not when you dequeue it, so duplicates are avoided early.
Part 2: Simulate a Multi-Worker Web Crawler
Constraints
- 1 <= num_workers <= 100000
- 1 <= len(start_url) <= 200
- 0 <= sum(len(links) for links in web.values()) <= 200000
- Each URL is a valid string starting with `http://` or `https://`.
- If a reachable URL is missing from `web`, it has no outgoing links.
Examples
Input: ('http://news.com', {'http://news.com': ['http://news.com/a', 'http://other.com/x', 'http://news.com/b'], 'http://news.com/a': ['http://news.com/b', 'http://news.com/c'], 'http://news.com/b': ['http://news.com'], 'http://news.com/c': [], 'http://other.com/x': ['http://news.com/c']}, 2)
Expected Output: 3
Explanation: Round 1 fetches the start page, round 2 fetches `a` and `b`, and round 3 fetches `c`.
Input: ('http://a.com', {'http://a.com': ['http://a.com/p1', 'http://a.com/p2'], 'http://a.com/p1': ['http://a.com/c', 'http://a.com/d'], 'http://a.com/p2': ['http://a.com/c', 'http://a.com/e', 'http://b.com/x'], 'http://a.com/c': [], 'http://a.com/d': [], 'http://a.com/e': []}, 2)
Expected Output: 4
Explanation: Both workers in round 2 discover `http://a.com/c`, but it must be enqueued only once.
Input: ('https://solo.org', {}, 5)
Expected Output: 1
Explanation: Edge case: only the start page exists, so the crawl finishes in one round.
Input: ('https://site.org/home', {'https://site.org/home': ['https://site.org/a', 'https://site.org/b', 'https://other.org/x'], 'https://site.org/a': ['https://site.org/c'], 'https://site.org/b': ['https://site.org/d'], 'https://site.org/c': [], 'https://site.org/d': []}, 1)
Expected Output: 5
Explanation: With one worker, the process becomes a serial BFS-style crawl: home, a, b, c, d.
Hints
- Use one queue for pending URLs and process it in batches of size at most `num_workers`.
- To guarantee at-most-once fetch across workers, add a URL to `visited` as soon as it is enqueued, not later.