Implement a same-host web crawler
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates a candidate's competency in graph traversal and state management, URL parsing and validation, duplicate detection and cycle handling, and concurrent coordination with per-host rate limiting.
Part 1: Crawl All Reachable URLs on the Same Hostname
Constraints
- 0 <= len(link_map) <= 10^4
- Each adjacency list may contain duplicate URLs, cycles, malformed URLs, and cross-host links.
- A valid URL must contain both a scheme and a hostname.
- Only exact hostname matches count; compare by parsed hostname, not by string prefix.
- You do not need to normalize paths, fragments, or query parameters.
Examples
Input: ('http://news.com', {'http://news.com': ['http://news.com/about', 'http://news.com/about', 'http://other.com/x'], 'http://news.com/about': ['http://news.com', 'http://news.com/contact'], 'http://news.com/contact': []})
Expected Output: ['http://news.com', 'http://news.com/about', 'http://news.com/contact']
Explanation: The crawler stays on news.com, ignores the duplicate about link, ignores other.com, and handles the cycle back to the start page.
Input: ('https://site.com', {'https://site.com': ['not a url', '/relative', 'https://site.com/page'], 'https://site.com/page': ['https://site.com', 'mailto:test@example.com', 'https://site.com/help'], 'https://site.com/help': []})
Expected Output: ['https://site.com', 'https://site.com/help', 'https://site.com/page']
Explanation: Malformed and relative links are ignored. Only valid same-host URLs are returned.
Input: ('not a url', {'not a url': ['http://a.com']})
Expected Output: []
Explanation: Edge case: the start URL itself is malformed, so nothing can be crawled.
Input: ('http://solo.com', {})
Expected Output: ['http://solo.com']
Explanation: Edge case: the start URL is valid but has no outgoing links and is missing from the map.
Solution
def solution(start_url, link_map):
from urllib.parse import urlparse
def host_of(url):
try:
parsed = urlparse(url)
except Exception:
return None
if not parsed.scheme or not parsed.hostname:
return None
return parsed.hostname
start_host = host_of(start_url)
if start_host is None:
return []
seen = {start_url}
stack = [start_url]
while stack:
url = stack.pop()
for nxt in link_map.get(url, []):
if nxt in seen:
continue
if host_of(nxt) != start_host:
continue
seen.add(nxt)
stack.append(nxt)
return sorted(seen)
Time complexity: O(V + E + V log V), where V is the number of reachable same-host valid URLs and E is the number of outgoing links examined.. Space complexity: O(V).
Hints
- Use a visited set so cycles and duplicate links do not cause repeated work.
- Parse the hostname from each URL; checking whether a string starts with the same prefix is not reliable.
Part 2: Simulate a Concurrent Web Crawler with Per-Host Rate Limits
Constraints
- 1 <= workers <= 50
- 0 <= rate_limit <= 10^4
- 0 <= total number of URLs discovered/fetched <= 10^4
- The sum of all link-list lengths can be up to 2 * 10^5.
- A valid URL must contain both a scheme and a hostname.
- Treat two URLs as the same page only if their full strings are exactly equal.
Examples
Input: (['http://a.com', 'http://b.com'], {'http://a.com': ['http://a.com/1', 'http://b.com/1'], 'http://b.com': ['http://b.com/2', 'http://a.com/1'], 'http://a.com/1': [], 'http://b.com/1': [], 'http://b.com/2': []}, 2, 2)
Expected Output: (['http://a.com', 'http://a.com/1', 'http://b.com', 'http://b.com/1', 'http://b.com/2'], [(0, 0, 'http://a.com'), (0, 1, 'http://b.com'), (2, 0, 'http://a.com/1'), (2, 1, 'http://b.com/1'), (4, 0, 'http://b.com/2')])
Explanation: The two seed URLs start in parallel at time 0. Each host must wait 2 time units before another fetch on the same host can start.
Input: (['http://x.com'], {'http://x.com': ['http://x.com/a', 'http://x.com/b', 'http://x.com/a'], 'http://x.com/a': [], 'http://x.com/b': []}, 2, 0)
Expected Output: (['http://x.com', 'http://x.com/a', 'http://x.com/b'], [(0, 0, 'http://x.com'), (1, 0, 'http://x.com/a'), (1, 1, 'http://x.com/b')])
Explanation: With rate_limit = 0, the same host can start multiple fetches at the same time once the links are discovered. Duplicate discovery of /a is ignored.
Input: (['not a url', 'https://good.com'], {'https://good.com': ['/relative', 'https://good.com/p'], 'https://good.com/p': ['bad://']}, 1, 1)
Expected Output: (['https://good.com', 'https://good.com/p'], [(0, 0, 'https://good.com'), (1, 0, 'https://good.com/p')])
Explanation: Malformed seeds and malformed discovered links are ignored. The valid page on good.com is fetched, then its child is fetched one time unit later.
Input: ([], {}, 3, 5)
Expected Output: ([], [])
Explanation: Edge case: no seed URLs means there is nothing to crawl.
Solution
def solution(start_urls, link_map, workers, rate_limit):
from collections import defaultdict
from urllib.parse import urlparse
import heapq
def host_of(url):
try:
parsed = urlparse(url)
except Exception:
return None
if not parsed.scheme or not parsed.hostname:
return None
return parsed.hostname
if workers <= 0 or rate_limit < 0:
return ([], [])
waiting = defaultdict(list) # host -> min-heap of discovered, not-yet-started URLs
seen = set() # URLs already discovered
next_allowed = defaultdict(int) # host -> earliest next start time
advert_version = defaultdict(int) # invalidates stale ready-heap entries
ready = [] # (url, host, version)
cooldown = [] # (time, host)
running = [] # (finish_time, worker_id, url)
free_workers = list(range(workers))
heapq.heapify(free_workers)
fetched = []
schedule = []
current_time = 0
def discover(url):
nonlocal current_time
host = host_of(url)
if host is None or url in seen:
return
seen.add(url)
before_top = waiting[host][0] if waiting[host] else None
heapq.heappush(waiting[host], url)
after_top = waiting[host][0]
if next_allowed[host] <= current_time:
if before_top is None or after_top != before_top:
advert_version[host] += 1
heapq.heappush(ready, (after_top, host, advert_version[host]))
else:
heapq.heappush(cooldown, (next_allowed[host], host))
for url in start_urls:
discover(url)
while True:
while running and running[0][0] <= current_time:
_, worker_id, url = heapq.heappop(running)
heapq.heappush(free_workers, worker_id)
for nxt in link_map.get(url, []):
discover(nxt)
while cooldown and cooldown[0][0] <= current_time:
_, host = heapq.heappop(cooldown)
if waiting[host] and next_allowed[host] <= current_time:
advert_version[host] += 1
heapq.heappush(ready, (waiting[host][0], host, advert_version[host]))
while free_workers:
candidate = None
while ready:
url, host, version = heapq.heappop(ready)
if version != advert_version[host]:
continue
if not waiting[host] or waiting[host][0] != url:
continue
if next_allowed[host] > current_time:
heapq.heappush(cooldown, (next_allowed[host], host))
continue
candidate = (url, host)
break
if candidate is None:
break
url, host = candidate
worker_id = heapq.heappop(free_workers)
heapq.heappop(waiting[host])
schedule.append((current_time, worker_id, url))
fetched.append(url)
heapq.heappush(running, (current_time + 1, worker_id, url))
next_allowed[host] = current_time + rate_limit
advert_version[host] += 1
if waiting[host]:
if next_allowed[host] <= current_time:
advert_version[host] += 1
heapq.heappush(ready, (waiting[host][0], host, advert_version[host]))
else:
heapq.heappush(cooldown, (next_allowed[host], host))
if not running and not cooldown:
break
next_time = float('inf')
if running:
next_time = min(next_time, running[0][0])
if cooldown:
next_time = min(next_time, cooldown[0][0])
if next_time == float('inf'):
break
current_time = next_time
return (sorted(fetched), schedule)
Time complexity: O((V + E) log V) in the usual heap-based model, where V is the number of valid unique URLs discovered and E is the total number of links processed.. Space complexity: O(V).
Hints
- Think in terms of events: fetch completions and host cooldown expirations both change what can run next.
- Use heaps to manage the next free worker, the next completion time, and the lexicographically smallest eligible URL.