How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Same-Domain Web Crawl (BFS) | Anthropic Coding Question

Quick Overview

This question evaluates a candidate's ability to implement graph traversal using breadth-first search while applying a domain-matching filter and deduplication logic. It tests core coding and algorithms skills, particularly correct handling of visited sets, queue ordering, and string parsing, commonly assessed in practical software engineering interviews as a proxy for clean, methodical implementation under constraints.

Same-Domain Web Crawl (BFS)

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

# Same-Domain Web Crawl (BFS) You are given a snapshot of a small portion of the web as a directed graph: a map from each page URL to the list of URLs it links to. Starting from a seed URL, return **all pages reachable from the seed that share the seed's root domain**, discovered via breadth-first search, with duplicates removed. A real crawler fetches each page over the network and would parallelize this with `asyncio`/`aiohttp`; here the link graph is provided directly so the focus is the traversal, the de-duplication, and the same-domain filter. ## Definitions - The **host** of a URL is the substring between `://` and the next `/` (or the end of the string). Example: in `https://blog.example.com/posts/1`, the host is `blog.example.com`. - The **root domain** of a host is its last two dot-separated labels. Example: `blog.example.com` → `example.com`; `example.com` → `example.com`. (Assume simple two-label roots; you do not need to handle multi-part public suffixes like `co.uk`.) - Two URLs are **same-domain** if their hosts have the same root domain. ## Input - `seed`: a URL string. - `graph`: `Dict[str, List[str]]` mapping a URL to the URLs found on that page. A URL that appears as a link but not as a key has no outgoing links. Links may point to other domains (which must be filtered out and not traversed), and may contain duplicates. ## Output - `List[str]`: every distinct same-domain URL reachable from `seed` (including `seed` itself), in **BFS visitation order** starting from `seed`. When a page lists several new links, enqueue them in the order they appear. ## Example Input: ``` seed = "https://example.com/" graph = { "https://example.com/": ["https://example.com/a", "https://blog.example.com/x", "https://other.com/"], "https://example.com/a": ["https://example.com/", "https://example.com/b"], "https://blog.example.com/x": ["https://other.com/", "https://example.com/b"], "https://example.com/b": [] } ``` Output: ``` ["https://example.com/", "https://example.com/a", "https://blog.example.com/x", "https://example.com/b"] ``` Explanation: `https://other.com/` is dropped (different root domain). `blog.example.com` shares the root `example.com`, so it is kept. Already-seen URLs (`https://example.com/`, the second sighting of `/b`) are not revisited.

Quick Answer: This question evaluates a candidate's ability to implement graph traversal using breadth-first search while applying a domain-matching filter and deduplication logic. It tests core coding and algorithms skills, particularly correct handling of visited sets, queue ordering, and string parsing, commonly assessed in practical software engineering interviews as a proxy for clean, methodical implementation under constraints.

You are given a snapshot of a small portion of the web as a directed graph: a map from each page URL to the list of URLs it links to. Starting from a seed URL, return all pages reachable from the seed that share the seed's root domain, discovered via breadth-first search, with duplicates removed. A real crawler fetches each page over the network and would parallelize this with asyncio/aiohttp; here the link graph is provided directly so the focus is the traversal, the de-duplication, and the same-domain filter. Definitions: - The host of a URL is the substring between '://' and the next '/' (or the end of the string). Example: in 'https://blog.example.com/posts/1', the host is 'blog.example.com'. - The root domain of a host is its last two dot-separated labels. Example: 'blog.example.com' -> 'example.com'; 'example.com' -> 'example.com'. (Assume simple two-label roots; you do not need to handle multi-part public suffixes like 'co.uk'.) - Two URLs are same-domain if their hosts have the same root domain. Input: - seed: a URL string. - graph: Dict[str, List[str]] mapping a URL to the URLs found on that page. A URL that appears as a link but not as a key has no outgoing links. Links may point to other domains (which must be filtered out and not traversed), and may contain duplicates. Output: - List[str]: every distinct same-domain URL reachable from seed (including seed itself), in BFS visitation order starting from seed. When a page lists several new links, enqueue them in the order they appear. Example: seed = "https://example.com/" graph = { "https://example.com/": ["https://example.com/a", "https://blog.example.com/x", "https://other.com/"], "https://example.com/a": ["https://example.com/", "https://example.com/b"], "https://blog.example.com/x": ["https://other.com/", "https://example.com/b"], "https://example.com/b": [] } Output: ["https://example.com/", "https://example.com/a", "https://blog.example.com/x", "https://example.com/b"] Explanation: 'https://other.com/' is dropped (different root domain). 'blog.example.com' shares the root 'example.com', so it is kept. Already-seen URLs are not revisited.

Constraints

Every URL uses a 'scheme://host/path' shape; the host is the text between '://' and the next '/'.
Root domain = the last two dot-separated labels of the host (no multi-part public suffixes like 'co.uk').
graph values may contain duplicate links and links to other domains; both must be handled.
A URL that appears only as a link (never as a key) is a page with no outgoing links.
The seed is always same-domain with itself and is the first element of the output.

Examples

Input: ("https://example.com/", {"https://example.com/": ["https://example.com/a", "https://blog.example.com/x", "https://other.com/"], "https://example.com/a": ["https://example.com/", "https://example.com/b"], "https://blog.example.com/x": ["https://other.com/", "https://example.com/b"], "https://example.com/b": []})

Expected Output: ["https://example.com/", "https://example.com/a", "https://blog.example.com/x", "https://example.com/b"]

Explanation: The given example. 'other.com' is dropped (different root); 'blog.example.com' shares root 'example.com' so it is kept; the second sighting of '/b' and the back-link to seed are not revisited.

Input: ("https://foo.com/", {})

Expected Output: ["https://foo.com/"]

Explanation: Edge case: the seed is not a key in the graph, so it has no outgoing links. Only the seed itself is reachable.

Input: ("https://a.com/1", {"https://a.com/1": ["https://a.com/2", "https://a.com/2", "https://x.com/1"], "https://a.com/2": ["https://a.com/1"]})

Expected Output: ["https://a.com/1", "https://a.com/2"]

Explanation: The duplicate link to '/2' is enqueued only once, the cross-domain 'x.com' link is filtered out, and the back-link to the already-visited seed is skipped.

Input: ("https://www.site.org/", {"https://www.site.org/": ["https://api.site.org/v1", "https://cdn.site.org/img"], "https://api.site.org/v1": ["https://www.site.org/"]})

Expected Output: ["https://www.site.org/", "https://api.site.org/v1", "https://cdn.site.org/img"]

Explanation: Different subdomains (www, api, cdn) all share the root domain 'site.org', so all are crawled; 'cdn.site.org/img' has no key and thus no outgoing links.

Input: ("https://n.io/a", {"https://n.io/a": ["https://n.io/b", "https://n.io/c"], "https://n.io/b": ["https://n.io/c", "https://n.io/d"], "https://n.io/c": ["https://n.io/a"], "https://n.io/d": ["https://n.io/b"]})

Expected Output: ["https://n.io/a", "https://n.io/b", "https://n.io/c", "https://n.io/d"]

Explanation: A cyclic graph. BFS visits a, then its links b and c, then b's new link d; every cycle back-edge lands on an already-visited node, so nothing is revisited.

Hints

Use a queue for BFS and a set for visited URLs so each page is emitted exactly once, in the order it is first dequeued's parent enqueues it.
Write a small helper to extract the host: split on '://' then take everything before the first '/'.
Compute the seed's root domain once; only enqueue a link if its root domain equals the seed's and it has not been visited.
Enqueue links in the order they appear on the page so the visitation order is a true BFS.

Last updated: Jul 1, 2026

Loading coding console...