Same-Domain Web Crawl (BFS)
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
# Same-Domain Web Crawl (BFS)
You are given a snapshot of a small portion of the web as a directed graph: a map from each page URL to the list of URLs it links to. Starting from a seed URL, return **all pages reachable from the seed that share the seed's root domain**, discovered via breadth-first search, with duplicates removed.
A real crawler fetches each page over the network and would parallelize this with `asyncio`/`aiohttp`; here the link graph is provided directly so the focus is the traversal, the de-duplication, and the same-domain filter.
## Definitions
- The **host** of a URL is the substring between `://` and the next `/` (or the end of the string). Example: in `https://blog.example.com/posts/1`, the host is `blog.example.com`.
- The **root domain** of a host is its last two dot-separated labels. Example: `blog.example.com` → `example.com`; `example.com` → `example.com`. (Assume simple two-label roots; you do not need to handle multi-part public suffixes like `co.uk`.)
- Two URLs are **same-domain** if their hosts have the same root domain.
## Input
- `seed`: a URL string.
- `graph`: `Dict[str, List[str]]` mapping a URL to the URLs found on that page. A URL that appears as a link but not as a key has no outgoing links. Links may point to other domains (which must be filtered out and not traversed), and may contain duplicates.
## Output
- `List[str]`: every distinct same-domain URL reachable from `seed` (including `seed` itself), in **BFS visitation order** starting from `seed`. When a page lists several new links, enqueue them in the order they appear.
## Example
Input:
```
seed = "https://example.com/"
graph = {
"https://example.com/": ["https://example.com/a", "https://blog.example.com/x", "https://other.com/"],
"https://example.com/a": ["https://example.com/", "https://example.com/b"],
"https://blog.example.com/x": ["https://other.com/", "https://example.com/b"],
"https://example.com/b": []
}
```
Output:
```
["https://example.com/", "https://example.com/a", "https://blog.example.com/x", "https://example.com/b"]
```
Explanation: `https://other.com/` is dropped (different root domain). `blog.example.com` shares the root `example.com`, so it is kept. Already-seen URLs (`https://example.com/`, the second sighting of `/b`) are not revisited.
Quick Answer: This question evaluates a candidate's ability to implement graph traversal using breadth-first search while applying a domain-matching filter and deduplication logic. It tests core coding and algorithms skills, particularly correct handling of visited sets, queue ordering, and string parsing, commonly assessed in practical software engineering interviews as a proxy for clean, methodical implementation under constraints.
You are given a snapshot of a small portion of the web as a directed graph: a map from each page URL to the list of URLs it links to. Starting from a seed URL, return all pages reachable from the seed that share the seed's root domain, discovered via breadth-first search, with duplicates removed.
A real crawler fetches each page over the network and would parallelize this with asyncio/aiohttp; here the link graph is provided directly so the focus is the traversal, the de-duplication, and the same-domain filter.
Definitions:
- The host of a URL is the substring between '://' and the next '/' (or the end of the string). Example: in 'https://blog.example.com/posts/1', the host is 'blog.example.com'.
- The root domain of a host is its last two dot-separated labels. Example: 'blog.example.com' -> 'example.com'; 'example.com' -> 'example.com'. (Assume simple two-label roots; you do not need to handle multi-part public suffixes like 'co.uk'.)
- Two URLs are same-domain if their hosts have the same root domain.
Input:
- seed: a URL string.
- graph: Dict[str, List[str]] mapping a URL to the URLs found on that page. A URL that appears as a link but not as a key has no outgoing links. Links may point to other domains (which must be filtered out and not traversed), and may contain duplicates.
Output:
- List[str]: every distinct same-domain URL reachable from seed (including seed itself), in BFS visitation order starting from seed. When a page lists several new links, enqueue them in the order they appear.
Example:
seed = "https://example.com/"
graph = {
"https://example.com/": ["https://example.com/a", "https://blog.example.com/x", "https://other.com/"],
"https://example.com/a": ["https://example.com/", "https://example.com/b"],
"https://blog.example.com/x": ["https://other.com/", "https://example.com/b"],
"https://example.com/b": []
}
Output: ["https://example.com/", "https://example.com/a", "https://blog.example.com/x", "https://example.com/b"]
Explanation: 'https://other.com/' is dropped (different root domain). 'blog.example.com' shares the root 'example.com', so it is kept. Already-seen URLs are not revisited.
Constraints
- Every URL uses a 'scheme://host/path' shape; the host is the text between '://' and the next '/'.
- Root domain = the last two dot-separated labels of the host (no multi-part public suffixes like 'co.uk').
- graph values may contain duplicate links and links to other domains; both must be handled.
- A URL that appears only as a link (never as a key) is a page with no outgoing links.
- The seed is always same-domain with itself and is the first element of the output.
Examples
Input: ("https://example.com/", {"https://example.com/": ["https://example.com/a", "https://blog.example.com/x", "https://other.com/"], "https://example.com/a": ["https://example.com/", "https://example.com/b"], "https://blog.example.com/x": ["https://other.com/", "https://example.com/b"], "https://example.com/b": []})
Expected Output: ["https://example.com/", "https://example.com/a", "https://blog.example.com/x", "https://example.com/b"]
Explanation: The given example. 'other.com' is dropped (different root); 'blog.example.com' shares root 'example.com' so it is kept; the second sighting of '/b' and the back-link to seed are not revisited.
Input: ("https://foo.com/", {})
Expected Output: ["https://foo.com/"]
Explanation: Edge case: the seed is not a key in the graph, so it has no outgoing links. Only the seed itself is reachable.
Input: ("https://a.com/1", {"https://a.com/1": ["https://a.com/2", "https://a.com/2", "https://x.com/1"], "https://a.com/2": ["https://a.com/1"]})
Expected Output: ["https://a.com/1", "https://a.com/2"]
Explanation: The duplicate link to '/2' is enqueued only once, the cross-domain 'x.com' link is filtered out, and the back-link to the already-visited seed is skipped.
Input: ("https://www.site.org/", {"https://www.site.org/": ["https://api.site.org/v1", "https://cdn.site.org/img"], "https://api.site.org/v1": ["https://www.site.org/"]})
Expected Output: ["https://www.site.org/", "https://api.site.org/v1", "https://cdn.site.org/img"]
Explanation: Different subdomains (www, api, cdn) all share the root domain 'site.org', so all are crawled; 'cdn.site.org/img' has no key and thus no outgoing links.
Input: ("https://n.io/a", {"https://n.io/a": ["https://n.io/b", "https://n.io/c"], "https://n.io/b": ["https://n.io/c", "https://n.io/d"], "https://n.io/c": ["https://n.io/a"], "https://n.io/d": ["https://n.io/b"]})
Expected Output: ["https://n.io/a", "https://n.io/b", "https://n.io/c", "https://n.io/d"]
Explanation: A cyclic graph. BFS visits a, then its links b and c, then b's new link d; every cycle back-edge lands on an already-visited node, so nothing is revisited.
Hints
- Use a queue for BFS and a set for visited URLs so each page is emitted exactly once, in the order it is first dequeued's parent enqueues it.
- Write a small helper to extract the host: split on '://' then take everything before the first '/'.
- Compute the seed's root domain once; only enqueue a link if its root domain equals the seed's and it has not been visited.
- Enqueue links in the order they appear on the page so the visitation order is a true BFS.