Design a single- and multi-threaded web crawler
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
## Web Crawler (single-threaded, then multi-threaded)
You are given:
- A starting URL `startUrl` (e.g., `"http://news.example.com/a/index.html"`).
- An interface `HtmlParser` with a method `getUrls(url)` that returns **all URLs** found on the page at `url`.
Your task is to crawl web pages starting from `startUrl` and return all **unique** pages that are reachable by following links, subject to the rules below.
### Rules
1. **Same-hostname only:** Only crawl pages whose hostname is the same as the hostname of `startUrl`.
- The hostname is the part between the scheme (`http://` or `https://`) and the next `/`.
2. **Fragment handling:** URLs may contain a fragment part starting with `#` (e.g., `http://a.com/x#section2`).
- You must **strip the fragment** (remove `#` and everything after it) before:
- deciding whether you have already visited the URL (deduplication), and
- including the URL in the returned result.
- Example: `http://a.com/x#p1` and `http://a.com/x#p2` should be treated as the **same page**: `http://a.com/x`.
3. **No other normalization:** Do **not** perform other URL normalization (e.g., do not canonicalize trailing slashes, default ports, query parameter ordering, etc.). Only strip the fragment.
### Part A — Single-threaded
Implement a single-threaded crawler that returns the set/list of visited URLs (after stripping fragments), restricted to the same hostname.
### Part B — Follow-up: Multi-threaded
Now implement a multi-threaded crawler to speed up crawling.
- Multiple worker threads may call `HtmlParser.getUrls` concurrently.
- Your solution must be thread-safe and must not crawl the same sanitized URL more than once.
### Output
Return all unique sanitized URLs that are reachable from `startUrl` and satisfy the hostname constraint.
### Assumptions / Notes
- `HtmlParser.getUrls(url)` is a black box and may be slow (network/IO-like).
- The link graph may contain cycles.
- URLs returned by `getUrls` are absolute URLs.
Quick Answer: This question evaluates skills in web crawling, URL parsing and fragment sanitization, graph traversal for deduplication, and concurrent programming for thread-safe crawling, and it falls under the Coding & Algorithms domain.
Part 1: Single-Threaded Web Crawler
You are given a starting URL `start_url` and a dictionary `web` that simulates `HtmlParser.getUrls(url)`. For any crawled page `u`, its outgoing links are `web.get(u, [])`.
Crawl all pages reachable from `start_url` using a single thread, with these rules:
1. Only crawl URLs whose hostname matches the hostname of `start_url`.
2. Before checking whether a URL was already visited, strip its fragment part (`#...`).
3. Return each page only once, using its fragment-free form.
4. Do not perform any other normalization. For example, `http://a.com`, `http://a.com/`, and `http://a.com:80` are all different URLs.
To make the output deterministic, return the visited URLs as a lexicographically sorted list.
Notes:
- `web` keys represent fragment-free page URLs.
- If a crawled page is missing from `web`, treat it as a page with no outgoing links.
Constraints
- 0 <= len(web) <= 10^4
- 0 <= total number of links across all pages <= 5 * 10^4
- All URLs are absolute and start with `http://` or `https://`
- Only fragment stripping is allowed; no other URL normalization may be applied
- The link graph may contain cycles
Examples
Input: ('http://news.example.com/a/index.html#top', {'http://news.example.com/a/index.html': ['http://news.example.com/b#section', 'http://other.example.com/x', 'http://news.example.com/c', 'http://news.example.com/b#other'], 'http://news.example.com/b': ['http://news.example.com/c#frag', 'http://news.example.com/a/index.html#again'], 'http://news.example.com/c': ['http://news.example.com/c#self']})
Expected Output: ['http://news.example.com/a/index.html', 'http://news.example.com/b', 'http://news.example.com/c']
Explanation: The crawler starts at the fragment-free URL `http://news.example.com/a/index.html`, ignores the external hostname, removes fragments before deduplication, and reaches pages a, b, and c.
Input: ('https://a.com/home', {'https://a.com/home': ['https://a.com/about#team', 'https://b.com/out', 'https://a.com/about#company'], 'https://a.com/about': ['https://a.com/home#top', 'https://a.com/about#self']})
Expected Output: ['https://a.com/about', 'https://a.com/home']
Explanation: Only URLs on hostname `a.com` are crawled. The two fragment variants of the about page collapse into one visited URL.
Input: ('http://a.com#start', {'http://a.com': ['http://a.com/#top', 'http://a.com:80/#frag', 'https://a.com/path#x', 'http://b.com/page'], 'http://a.com/': ['http://a.com#again'], 'http://a.com:80/': []})
Expected Output: ['http://a.com', 'http://a.com/', 'http://a.com:80/', 'https://a.com/path']
Explanation: All returned URLs share hostname `a.com`, but they remain distinct because only fragments are removed. The missing page `https://a.com/path` is still visited and treated as having no outgoing links.
Input: ('https://solo.example.org/page#frag', {})
Expected Output: ['https://solo.example.org/page']
Explanation: Even if the start page is missing from `web`, it is still reachable from itself after fragment removal.
Solution
def solution(start_url, web):
def strip_fragment(url):
i = url.find('#')
return url[:i] if i != -1 else url
def get_hostname(url):
i = url.find('://')
if i == -1:
return None
j = i + 3
end = len(url)
e = end
for ch in '/#?':
k = url.find(ch, j)
if k != -1 and k < e:
e = k
hostport = url[j:e]
if not hostport:
return None
if hostport.startswith('['):
rb = hostport.find(']')
if rb == -1:
return None
host = hostport[1:rb]
else:
c = hostport.find(':')
host = hostport if c == -1 else hostport[:c]
return host.lower() if host else None
start_clean = strip_fragment(start_url)
start_host = get_hostname(start_clean)
if start_host is None:
return []
visited = set()
stack = [start_clean]
while stack:
cur = strip_fragment(stack.pop())
if get_hostname(cur) != start_host:
continue
if cur in visited:
continue
visited.add(cur)
for nxt in web.get(cur, []):
clean = strip_fragment(nxt)
if get_hostname(clean) == start_host and clean not in visited:
stack.append(clean)
return sorted(visited)
Explanation
This is a single-threaded graph traversal (iterative DFS) over the simulated web, where each page is a node and its `web.get(u, [])` links are directed edges.
**Setup.** Two helpers do the URL handling:
- `strip_fragment(url)` cuts everything from the first `#` onward, giving the fragment-free form used for dedup and output.
- `get_hostname(url)` finds the authority between `://` and the next `/`, `#`, or `?`, then drops any `:port` (and unwraps `[...]` for IPv6 literals), returning the lowercased host. Note it ignores scheme and port, so only the bare host is compared.
We clean `start_url`, compute `start_host`, and bail out with `[]` if it can't be parsed.
**Traversal.** A `stack` (so this is DFS, not BFS) seeded with `start_clean` drives the loop. For each popped URL we:
1. strip its fragment,
2. skip it if its hostname differs from `start_host`,
3. skip it if already in `visited`,
4. otherwise mark it visited and push every neighbor whose fragment-free form has the matching host and isn't yet visited.
**Why it's correct.** Fragment stripping happens *before* the visited check, so `.../b#section` and `.../b#other` collapse to one node — matching rule 3. The hostname guard enforces rule 1, and because no other normalization is applied, `http://a.com`, `http://a.com/`, and `http://a.com:80/` stay distinct keys (rule 4) yet all share host `a.com`. The `visited` set breaks cycles, guaranteeing termination. Finally `sorted(visited)` returns the lexicographically ordered unique pages.
Time complexity: O(V + E + V log V). Space complexity: O(V + E).
Hints
- Strip the fragment from every URL as soon as you read it, before deduplication.
- A DFS or BFS with a `visited` set is enough once you can extract the hostname.
Part 2: Multi-Threaded Web Crawler
You are given the same simulated web as in Part 1: a starting URL `start_url` and a dictionary `web` where `web.get(u, [])` represents the URLs returned by `HtmlParser.getUrls(u)`.
Implement a multi-threaded crawler that follows the same rules:
1. Only crawl URLs whose hostname matches the hostname of `start_url`.
2. Strip fragments (`#...`) before deduplication and before adding a URL to the result.
3. Do not perform any other normalization.
4. The same sanitized URL must never be crawled more than once, even when multiple threads run concurrently.
Return the final set of visited pages as a lexicographically sorted list.
Notes:
- `web` keys represent fragment-free page URLs.
- If a crawled page is missing from `web`, treat it as having no outgoing links.
- Your implementation should use thread-safe coordination internally.
Constraints
- 0 <= len(web) <= 10^4
- 0 <= total number of links across all pages <= 5 * 10^4
- All URLs are absolute and start with `http://` or `https://`
- Only fragment stripping is allowed; no other URL normalization may be applied
- The link graph may contain cycles
- The solution must be thread-safe
Solution
def solution(start_url, web):
from collections import deque
def sanitize(url):
return url.split('#', 1)[0]
def get_host(url):
i = url.find('://')
if i == -1:
rest = url
else:
rest = url[i + 3:]
j = rest.find('/')
return rest if j == -1 else rest[:j]
start = sanitize(start_url)
target_host = get_host(start)
seen = set([start])
q = deque([start])
while q:
cur = q.popleft()
for nxt in web.get(cur, []) or []:
s = sanitize(nxt)
if get_host(s) != target_host:
continue
if s not in seen:
seen.add(s)
q.append(s)
return sorted(seen)
Explanation
The reference solution treats this as a **graph reachability problem** and solves it with a standard single-threaded BFS — the locking the prompt describes is encapsulated as "internal coordination," so a sequential traversal produces the same final visited set.
**Two helpers do the URL work:**
- `sanitize(url)` strips fragments by returning `url.split('#', 1)[0]` — everything before the first `#`. This is the only normalization applied, matching the rules.
- `get_host(url)` extracts the hostname: it locates `://`, then takes the substring up to the next `/`. Crucially this keeps the **port** (`h.com:8080` stays distinct from `h.com`), which is why test 4 correctly rejects `http://h.com/a`.
**The traversal:**
1. Sanitize `start_url` and record its host as `target_host`.
2. Seed `seen = {start}` and a `deque` with `start`.
3. Pop a page `cur`, look up its outgoing links via `web.get(cur, [])` (a missing page yields `[]`, so absent pages contribute no links).
4. For each link: sanitize it, **skip** it if its host differs from `target_host`, and otherwise add it to `seen` and the queue only if not already seen.
5. Continue until the queue drains, then return `sorted(seen)`.
**Why it's correct:** the `seen` set guarantees every sanitized same-host URL is enqueued exactly once, so cycles in the link graph terminate and no page is crawled twice. Filtering on the sanitized host before insertion enforces the same-hostname rule, and fragment stripping before both the dedup check and the result guarantees `/d#frag` and `/d` collapse to one entry. The final `sorted` call returns the lexicographically ordered list.
Time complexity: O((V + E) · L). Space complexity: O(V · L).
Hints
- Use a shared `visited` set protected by a lock, and mark a URL as visited before enqueueing it.
- A thread-safe queue is a natural way to distribute crawl tasks among worker threads.