PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of web crawling and link traversal algorithms, including URL normalization, same-domain filtering, deduplication, and cycle-safe graph traversal.

  • hard
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Crawl Same-Domain Links

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Technical Screen

Implement a Python function that crawls a website starting from a seed URL and returns all unique pages reachable within the same domain. Assume a helper function `get_links(url) -> list[str]` is available and returns all hyperlinks found on that page. Requirements: - Start from a single input URL. - Only follow links whose hostname matches the hostname of the seed URL. - Do not visit the same URL more than once. - Handle cycles safely. - Continue until there are no more unseen same-domain links. You may choose either BFS or DFS traversal. Be prepared to discuss how you would normalize URLs to reduce duplicate visits.

Quick Answer: This question evaluates understanding of web crawling and link traversal algorithms, including URL normalization, same-domain filtering, deduplication, and cycle-safe graph traversal.

In a real crawler, you might be given a helper function `get_links(url)` that returns all hyperlinks found on a page. For this coding problem, that helper is simulated by a dictionary `web`, where each key is a URL and each value is the list of hyperlinks on that page. Implement `solution(start_url, web)` to crawl the site starting from `start_url` and return every unique page reachable within the same domain. Rules: - Start from exactly one seed URL. - Only follow links whose hostname exactly matches the hostname of the seed URL. - Do not visit the same page more than once. - Handle cycles safely. - Continue until there are no more unseen same-domain links. - If a reachable URL is not present as a key in `web`, treat it as a page with no outgoing links. To make the result deterministic and to reduce duplicate visits, normalize every URL before comparing or storing it: - lowercase the scheme and hostname - remove the fragment part after `#` - if the path is empty, treat it as `/` Return the visited normalized URLs in lexicographic order. Note: subdomains do not count as the same domain. For example, `example.com` and `blog.example.com` are different hostnames.

Constraints

  • 0 <= len(web) <= 10^4
  • The total number of hyperlinks across all pages is at most 2 * 10^5
  • All URLs are absolute HTTP/HTTPS-style URLs
  • Same-domain means exact hostname match after normalization

Examples

Input: ('https://example.com', {'https://example.com': ['https://example.com/about', 'https://other.com/', 'https://example.com/blog'], 'https://example.com/about': ['https://example.com', 'https://example.com/contact'], 'https://example.com/blog': ['https://example.com/about', 'https://blog.example.com/post'], 'https://example.com/contact': []})

Expected Output: ['https://example.com/', 'https://example.com/about', 'https://example.com/blog', 'https://example.com/contact']

Explanation: The crawler stays on hostname `example.com`, ignores `other.com` and `blog.example.com`, and safely handles the cycle back to the home page.

Input: ('HTTPS://Example.com#top', {'https://example.com/': ['https://example.com/docs#section1', 'https://example.com/docs', 'https://example.com/faq'], 'https://example.com/docs': ['https://example.com/#footer'], 'https://example.com/faq': ['https://example.com/docs#section2']})

Expected Output: ['https://example.com/', 'https://example.com/docs', 'https://example.com/faq']

Explanation: The start URL is normalized to `https://example.com/`. Fragments like `#top` and `#section1` are removed, so duplicate visits are avoided.

Input: ('https://site.com/home', {'https://site.com/home': ['https://site.com/a', 'https://news.site.com/a', 'https://site.com/b'], 'https://site.com/a': ['https://site.com/missing', 'https://site.com/home'], 'https://site.com/b': []})

Expected Output: ['https://site.com/a', 'https://site.com/b', 'https://site.com/home', 'https://site.com/missing']

Explanation: The crawler ignores the subdomain `news.site.com`. It still includes `https://site.com/missing` because it is reachable, even though it has no entry in `web`.

Input: ('https://alone.com', {})

Expected Output: ['https://alone.com/']

Explanation: Even when the seed page is not present in `web`, it is still a reachable page and should be returned after normalization.

Input: ('https://dup.com', {'https://dup.com': ['https://dup.com/page#one'], 'https://dup.com#frag': ['https://dup.com/page'], 'https://dup.com/page': ['https://dup.com/#top', 'https://other.com/page']})

Expected Output: ['https://dup.com/', 'https://dup.com/page']

Explanation: Both `https://dup.com` and `https://dup.com#frag` normalize to the same page, and fragment-only differences do not cause duplicate visits.

Hints

  1. Model the website as a graph: pages are nodes and hyperlinks are directed edges. A stack or queue plus a visited set is enough.
  2. Use URL parsing to extract the hostname and to normalize each URL before storing it, so fragments and case differences do not create duplicate visits.
Last updated: Apr 22, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement a Banking System - Anthropic (medium)
  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)