PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates skills in graph traversal and cycle detection, hostname-based filtering, URL parsing, and basic network error handling. Commonly asked in Coding & Algorithms interviews, it assesses algorithmic reasoning about traversal strategy and complexity (time and space) analysis while balancing conceptual understanding with practical implementation details.

  • Medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Implement hostname-restricted web crawler

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Implement a single-threaded web crawler that, given a starting URL startUrl and an interface getUrls(url) that returns all hyperlinks on the page at url, returns the set of all unique pages reachable whose hostname exactly matches the hostname of startUrl. Requirements: do not fetch pages outside the starting hostname; avoid duplicate fetches and infinite loops (handle cycles); choose BFS or DFS and justify your choice; state time and space complexity; explain how you parse the hostname from a URL; describe basic error handling for failed requests.

Quick Answer: This question evaluates skills in graph traversal and cycle detection, hostname-based filtering, URL parsing, and basic network error handling. Commonly asked in Coding & Algorithms interviews, it assesses algorithmic reasoning about traversal strategy and complexity (time and space) analysis while balancing conceptual understanding with practical implementation details.

You are given a starting URL startUrl and a dictionary web that simulates the interface getUrls(url). If a URL exists as a key in web, fetching that page succeeds and returns the list of hyperlinks on that page. If a URL is missing from web, the request is considered failed. Implement a single-threaded web crawler using BFS that returns all unique reachable pages whose hostname exactly matches the hostname of startUrl. Never fetch pages outside the starting hostname, avoid duplicate fetches, and handle cycles safely. For deterministic grading, return the final set as a lexicographically sorted list. Parse the hostname from a URL using the hostname component only (for example, with urlparse(url).hostname), so scheme, port, path, query, and fragment do not affect the host comparison. For basic error handling, if fetching a discovered same-host URL fails, keep that URL in the result but do not expand its outgoing links.

Constraints

  • 0 <= len(web) <= 10^4
  • The total number of hyperlinks across all lists in web is at most 2 * 10^5
  • All URLs in test cases are absolute URL strings, but some discovered URLs may be missing from web to simulate request failures
  • The crawler must be single-threaded and must not fetch any URL whose hostname differs from the hostname of startUrl

Examples

Input: ('http://a.com', {'http://a.com': ['http://a.com/about', 'http://b.com/out'], 'http://a.com/about': [], 'http://b.com/out': ['http://a.com/hidden'], 'http://a.com/hidden': []})

Expected Output: ['http://a.com', 'http://a.com/about']

Explanation: The crawler visits the start page and http://a.com/about. It must not fetch http://b.com/out because the hostname differs, so http://a.com/hidden is never discovered.

Input: ('http://news.com', {'http://news.com': ['http://news.com/a', 'http://news.com/b', 'http://news.com/a'], 'http://news.com/a': ['http://news.com', 'http://news.com/b'], 'http://news.com/b': ['http://news.com/c'], 'http://news.com/c': ['http://news.com/a']})

Expected Output: ['http://news.com', 'http://news.com/a', 'http://news.com/b', 'http://news.com/c']

Explanation: There are duplicate links and a cycle, but each same-host URL is fetched at most once because of the visited set.

Input: ('http://site.com', {'http://site.com': ['http://site.com/about', 'http://site.com/missing', 'http://other.com'], 'http://site.com/about': ['http://site.com/contact'], 'http://site.com/contact': []})

Expected Output: ['http://site.com', 'http://site.com/about', 'http://site.com/contact', 'http://site.com/missing']

Explanation: http://site.com/missing is discovered on the same hostname, so it belongs in the result. Its fetch fails because it is not in web, so it is not expanded. The off-host URL is ignored completely.

Input: ('http://broken.com/home', {})

Expected Output: ['http://broken.com/home']

Explanation: This is an edge case where the first fetch fails immediately. The start URL is still included because it is the initial reachable page, but no further URLs can be discovered.

Input: ('https://shop.com:8080', {'https://shop.com:8080': ['http://shop.com/a', 'https://shop.com:9090/b', 'https://shop.com.au/x'], 'http://shop.com/a': [], 'https://shop.com:9090/b': []})

Expected Output: ['http://shop.com/a', 'https://shop.com:8080', 'https://shop.com:9090/b']

Explanation: The hostname is shop.com for the start page and for both same-host links, even though their schemes and ports differ. shop.com.au is a different hostname and must be ignored.

Solution

def solution(startUrl, web):
    from collections import deque
    from urllib.parse import urlparse

    # BFS is used here because an iterative queue keeps the crawler single-threaded
    # and avoids recursion-depth issues that a recursive DFS could hit on deep link chains.
    def get_hostname(url):
        try:
            return urlparse(url).hostname
        except Exception:
            return None

    start_host = get_hostname(startUrl)
    if not start_host:
        return []

    visited = {startUrl}
    queue = deque([startUrl])

    while queue:
        current = queue.popleft()

        # Basic error handling for failed or malformed fetch results:
        # if the page cannot be fetched (missing key) or does not return a list,
        # keep the URL in the result but do not expand it.
        try:
            links = web[current]
        except Exception:
            continue

        if not isinstance(links, list):
            continue

        for next_url in links:
            if get_hostname(next_url) != start_host:
                continue
            if next_url in visited:
                continue
            visited.add(next_url)
            queue.append(next_url)

    return sorted(visited)

Time complexity: O(V + E), where V is the number of unique same-host URLs discovered and E is the total number of links examined from successfully fetched same-host pages. Space complexity: O(V).

Hints

  1. Use a queue and a visited set. The queue gives you an iterative BFS, and the visited set prevents duplicate fetches and infinite loops.
  2. Compare only the parsed hostname component, not the whole URL string. Two URLs with different schemes or ports can still belong to the same hostname.
Last updated: Jun 1, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)
  • Implement Task Management and Duplicate Detection - Anthropic (medium)