How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a hard difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Crawl Same-Domain Links | Anthropic Coding Question

Crawl Same-Domain Links

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Technical Screen

Implement a Python function that crawls a website starting from a seed URL and returns all unique pages reachable within the same domain. Assume a helper function `get_links(url) -> list[str]` is available and returns all hyperlinks found on that page. Requirements: - Start from a single input URL. - Only follow links whose hostname matches the hostname of the seed URL. - Do not visit the same URL more than once. - Handle cycles safely. - Continue until there are no more unseen same-domain links. You may choose either BFS or DFS traversal. Be prepared to discuss how you would normalize URLs to reduce duplicate visits.

Quick Answer: This question evaluates understanding of web crawling and link traversal algorithms, including URL normalization, same-domain filtering, deduplication, and cycle-safe graph traversal.

In a real crawler, you might be given a helper function `get_links(url)` that returns all hyperlinks found on a page. For this coding problem, that helper is simulated by a dictionary `web`, where each key is a URL and each value is the list of hyperlinks on that page. Implement `solution(start_url, web)` to crawl the site starting from `start_url` and return every unique page reachable within the same domain. Rules: - Start from exactly one seed URL. - Only follow links whose hostname exactly matches the hostname of the seed URL. - Do not visit the same page more than once. - Handle cycles safely. - Continue until there are no more unseen same-domain links. - If a reachable URL is not present as a key in `web`, treat it as a page with no outgoing links. To make the result deterministic and to reduce duplicate visits, normalize every URL before comparing or storing it: - lowercase the scheme and hostname - remove the fragment part after `#` - if the path is empty, treat it as `/` Return the visited normalized URLs in lexicographic order. Note: subdomains do not count as the same domain. For example, `example.com` and `blog.example.com` are different hostnames.

Constraints

0 <= len(web) <= 10^4
The total number of hyperlinks across all pages is at most 2 * 10^5
All URLs are absolute HTTP/HTTPS-style URLs
Same-domain means exact hostname match after normalization

Examples

Input: ('https://example.com', {'https://example.com': ['https://example.com/about', 'https://other.com/', 'https://example.com/blog'], 'https://example.com/about': ['https://example.com', 'https://example.com/contact'], 'https://example.com/blog': ['https://example.com/about', 'https://blog.example.com/post'], 'https://example.com/contact': []})

Expected Output: ['https://example.com/', 'https://example.com/about', 'https://example.com/blog', 'https://example.com/contact']

Explanation: The crawler stays on hostname `example.com`, ignores `other.com` and `blog.example.com`, and safely handles the cycle back to the home page.

Input: ('HTTPS://Example.com#top', {'https://example.com/': ['https://example.com/docs#section1', 'https://example.com/docs', 'https://example.com/faq'], 'https://example.com/docs': ['https://example.com/#footer'], 'https://example.com/faq': ['https://example.com/docs#section2']})

Expected Output: ['https://example.com/', 'https://example.com/docs', 'https://example.com/faq']

Explanation: The start URL is normalized to `https://example.com/`. Fragments like `#top` and `#section1` are removed, so duplicate visits are avoided.

Input: ('https://site.com/home', {'https://site.com/home': ['https://site.com/a', 'https://news.site.com/a', 'https://site.com/b'], 'https://site.com/a': ['https://site.com/missing', 'https://site.com/home'], 'https://site.com/b': []})

Expected Output: ['https://site.com/a', 'https://site.com/b', 'https://site.com/home', 'https://site.com/missing']

Explanation: The crawler ignores the subdomain `news.site.com`. It still includes `https://site.com/missing` because it is reachable, even though it has no entry in `web`.

Input: ('https://alone.com', {})

Expected Output: ['https://alone.com/']

Explanation: Even when the seed page is not present in `web`, it is still a reachable page and should be returned after normalization.

Input: ('https://dup.com', {'https://dup.com': ['https://dup.com/page#one'], 'https://dup.com#frag': ['https://dup.com/page'], 'https://dup.com/page': ['https://dup.com/#top', 'https://other.com/page']})

Expected Output: ['https://dup.com/', 'https://dup.com/page']

Explanation: Both `https://dup.com` and `https://dup.com#frag` normalize to the same page, and fragment-only differences do not cause duplicate visits.

Hints

Model the website as a graph: pages are nodes and hyperlinks are directed edges. A stack or queue plus a visited set is enough.
Use URL parsing to extract the hostname and to normalize each URL before storing it, so fragments and case differences do not create duplicate visits.

Quick Overview

This question evaluates understanding of web crawling and link traversal algorithms, including URL normalization, same-domain filtering, deduplication, and cycle-safe graph traversal.