Design a single- and multi-threaded web crawler
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates skills in web crawling, URL parsing and fragment sanitization, graph traversal for deduplication, and concurrent programming for thread-safe crawling, and it falls under the Coding & Algorithms domain.
Part 1: Single-Threaded Web Crawler
Constraints
- 0 <= len(web) <= 10^4
- 0 <= total number of links across all pages <= 5 * 10^4
- All URLs are absolute and start with `http://` or `https://`
- Only fragment stripping is allowed; no other URL normalization may be applied
- The link graph may contain cycles
Examples
Input: ('http://news.example.com/a/index.html#top', {'http://news.example.com/a/index.html': ['http://news.example.com/b#section', 'http://other.example.com/x', 'http://news.example.com/c', 'http://news.example.com/b#other'], 'http://news.example.com/b': ['http://news.example.com/c#frag', 'http://news.example.com/a/index.html#again'], 'http://news.example.com/c': ['http://news.example.com/c#self']})
Expected Output: ['http://news.example.com/a/index.html', 'http://news.example.com/b', 'http://news.example.com/c']
Explanation: The crawler starts at the fragment-free URL `http://news.example.com/a/index.html`, ignores the external hostname, removes fragments before deduplication, and reaches pages a, b, and c.
Input: ('https://a.com/home', {'https://a.com/home': ['https://a.com/about#team', 'https://b.com/out', 'https://a.com/about#company'], 'https://a.com/about': ['https://a.com/home#top', 'https://a.com/about#self']})
Expected Output: ['https://a.com/about', 'https://a.com/home']
Explanation: Only URLs on hostname `a.com` are crawled. The two fragment variants of the about page collapse into one visited URL.
Input: ('http://a.com#start', {'http://a.com': ['http://a.com/#top', 'http://a.com:80/#frag', 'https://a.com/path#x', 'http://b.com/page'], 'http://a.com/': ['http://a.com#again'], 'http://a.com:80/': []})
Expected Output: ['http://a.com', 'http://a.com/', 'http://a.com:80/', 'https://a.com/path']
Explanation: All returned URLs share hostname `a.com`, but they remain distinct because only fragments are removed. The missing page `https://a.com/path` is still visited and treated as having no outgoing links.
Input: ('https://solo.example.org/page#frag', {})
Expected Output: ['https://solo.example.org/page']
Explanation: Even if the start page is missing from `web`, it is still reachable from itself after fragment removal.
Solution
def solution(start_url, web):
from urllib.parse import urlsplit
def strip_fragment(url):
return url.split('#', 1)[0]
def hostname(url):
try:
return urlsplit(url).hostname
except Exception:
return None
start_clean = strip_fragment(start_url)
start_host = hostname(start_clean)
if start_host is None:
return []
visited = set()
stack = [start_clean]
while stack:
url = strip_fragment(stack.pop())
if url in visited:
continue
if hostname(url) != start_host:
continue
visited.add(url)
for next_url in web.get(url, []):
clean = strip_fragment(next_url)
if hostname(clean) == start_host and clean not in visited:
stack.append(clean)
return sorted(visited)Time complexity: O(V + E + V log V). Space complexity: O(V).
Hints
- Strip the fragment from every URL as soon as you read it, before deduplication.
- A DFS or BFS with a `visited` set is enough once you can extract the hostname.
Part 2: Multi-Threaded Web Crawler
Constraints
- 0 <= len(web) <= 10^4
- 0 <= total number of links across all pages <= 5 * 10^4
- All URLs are absolute and start with `http://` or `https://`
- Only fragment stripping is allowed; no other URL normalization may be applied
- The link graph may contain cycles
- The solution must be thread-safe
Solution
def solution(start_url, web):
from queue import Queue
from threading import Lock, Thread
from urllib.parse import urlsplit
def sanitize(url):
return url.split('#', 1)[0]
def get_host(url):
return urlsplit(url).netloc
start = sanitize(start_url)
target_host = get_host(start)
seen = {start}
seen_lock = Lock()
tasks = Queue()
tasks.put(start)
def worker():
while True:
current = tasks.get()
if current is None:
tasks.task_done()
return
for nxt in web.get(current, []):
nxt = sanitize(nxt)
if get_host(nxt) != target_host:
continue
should_enqueue = False
with seen_lock:
if nxt not in seen:
seen.add(nxt)
should_enqueue = True
if should_enqueue:
tasks.put(nxt)
tasks.task_done()
worker_count = 4
threads = [Thread(target=worker) for _ in range(worker_count)]
for t in threads:
t.start()
tasks.join()
for _ in threads:
tasks.put(None)
tasks.join()
for t in threads:
t.join()
return sorted(seen)Time complexity: O(V + E) total work, where V is the number of reachable same-host pages after fragment stripping and E is the number of examined links. Wall-clock time can improve with multiple threads when link fetching is slow.. Space complexity: O(V + W), where W is the number of worker threads; dominated by the visited set and task queue..
Hints
- Use a shared `visited` set protected by a lock, and mark a URL as visited before enqueueing it.
- A thread-safe queue is a natural way to distribute crawl tasks among worker threads.