Implement hostname-restricted web crawler
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates skills in graph traversal and cycle detection, hostname-based filtering, URL parsing, and basic network error handling. Commonly asked in Coding & Algorithms interviews, it assesses algorithmic reasoning about traversal strategy and complexity (time and space) analysis while balancing conceptual understanding with practical implementation details.
Constraints
- 0 <= len(web) <= 10^4
- The total number of hyperlinks across all lists in web is at most 2 * 10^5
- All URLs in test cases are absolute URL strings, but some discovered URLs may be missing from web to simulate request failures
- The crawler must be single-threaded and must not fetch any URL whose hostname differs from the hostname of startUrl
Examples
Input: ('http://a.com', {'http://a.com': ['http://a.com/about', 'http://b.com/out'], 'http://a.com/about': [], 'http://b.com/out': ['http://a.com/hidden'], 'http://a.com/hidden': []})
Expected Output: ['http://a.com', 'http://a.com/about']
Explanation: The crawler visits the start page and http://a.com/about. It must not fetch http://b.com/out because the hostname differs, so http://a.com/hidden is never discovered.
Input: ('http://news.com', {'http://news.com': ['http://news.com/a', 'http://news.com/b', 'http://news.com/a'], 'http://news.com/a': ['http://news.com', 'http://news.com/b'], 'http://news.com/b': ['http://news.com/c'], 'http://news.com/c': ['http://news.com/a']})
Expected Output: ['http://news.com', 'http://news.com/a', 'http://news.com/b', 'http://news.com/c']
Explanation: There are duplicate links and a cycle, but each same-host URL is fetched at most once because of the visited set.
Input: ('http://site.com', {'http://site.com': ['http://site.com/about', 'http://site.com/missing', 'http://other.com'], 'http://site.com/about': ['http://site.com/contact'], 'http://site.com/contact': []})
Expected Output: ['http://site.com', 'http://site.com/about', 'http://site.com/contact', 'http://site.com/missing']
Explanation: http://site.com/missing is discovered on the same hostname, so it belongs in the result. Its fetch fails because it is not in web, so it is not expanded. The off-host URL is ignored completely.
Input: ('http://broken.com/home', {})
Expected Output: ['http://broken.com/home']
Explanation: This is an edge case where the first fetch fails immediately. The start URL is still included because it is the initial reachable page, but no further URLs can be discovered.
Input: ('https://shop.com:8080', {'https://shop.com:8080': ['http://shop.com/a', 'https://shop.com:9090/b', 'https://shop.com.au/x'], 'http://shop.com/a': [], 'https://shop.com:9090/b': []})
Expected Output: ['http://shop.com/a', 'https://shop.com:8080', 'https://shop.com:9090/b']
Explanation: The hostname is shop.com for the start page and for both same-host links, even though their schemes and ports differ. shop.com.au is a different hostname and must be ignored.
Solution
def solution(startUrl, web):
from collections import deque
from urllib.parse import urlparse
# BFS is used here because an iterative queue keeps the crawler single-threaded
# and avoids recursion-depth issues that a recursive DFS could hit on deep link chains.
def get_hostname(url):
try:
return urlparse(url).hostname
except Exception:
return None
start_host = get_hostname(startUrl)
if not start_host:
return []
visited = {startUrl}
queue = deque([startUrl])
while queue:
current = queue.popleft()
# Basic error handling for failed or malformed fetch results:
# if the page cannot be fetched (missing key) or does not return a list,
# keep the URL in the result but do not expand it.
try:
links = web[current]
except Exception:
continue
if not isinstance(links, list):
continue
for next_url in links:
if get_hostname(next_url) != start_host:
continue
if next_url in visited:
continue
visited.add(next_url)
queue.append(next_url)
return sorted(visited)
Time complexity: O(V + E), where V is the number of unique same-host URLs discovered and E is the total number of links examined from successfully fetched same-host pages. Space complexity: O(V).
Hints
- Use a queue and a visited set. The queue gives you an iterative BFS, and the visited set prevents duplicate fetches and infinite loops.
- Compare only the parsed hostname component, not the whole URL string. Two URLs with different schemes or ports can still belong to the same hostname.