PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates a candidate's competency in graph traversal and state management, URL parsing and validation, duplicate detection and cycle handling, and concurrent coordination with per-host rate limiting.

  • Medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Implement a same-host web crawler

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

Implement a web crawler. Given a start URL and a function get_links(url) -> list of URLs, return all URLs that share the same hostname as the start. Avoid revisiting pages, handle cycles and duplicate links, and be robust to malformed URLs. Describe time/space complexity and provide code. As a follow-up, explain how you would extend it to run concurrently with multiple workers while preserving correctness (no duplicate fetches) and respecting a per-host rate limit.

Quick Answer: This question evaluates a candidate's competency in graph traversal and state management, URL parsing and validation, duplicate detection and cycle handling, and concurrent coordination with per-host rate limiting.

Part 1: Crawl All Reachable URLs on the Same Hostname

You are given a start URL and a link map that simulates get_links(url). Starting from start_url, crawl reachable pages and return every valid URL whose hostname exactly matches the hostname of start_url. Do not revisit pages. Ignore malformed URLs, duplicate links, and links to other hostnames. Cycles may exist. For deterministic output, return the final list sorted lexicographically. A URL is considered valid if it has both a scheme and a hostname. Treat two URLs as different pages if their full strings are different; no extra normalization is required.

Constraints

  • 0 <= len(link_map) <= 10^4
  • Each adjacency list may contain duplicate URLs, cycles, malformed URLs, and cross-host links.
  • A valid URL must contain both a scheme and a hostname.
  • Only exact hostname matches count; compare by parsed hostname, not by string prefix.
  • You do not need to normalize paths, fragments, or query parameters.

Examples

Input: ('http://news.com', {'http://news.com': ['http://news.com/about', 'http://news.com/about', 'http://other.com/x'], 'http://news.com/about': ['http://news.com', 'http://news.com/contact'], 'http://news.com/contact': []})

Expected Output: ['http://news.com', 'http://news.com/about', 'http://news.com/contact']

Explanation: The crawler stays on news.com, ignores the duplicate about link, ignores other.com, and handles the cycle back to the start page.

Input: ('https://site.com', {'https://site.com': ['not a url', '/relative', 'https://site.com/page'], 'https://site.com/page': ['https://site.com', 'mailto:test@example.com', 'https://site.com/help'], 'https://site.com/help': []})

Expected Output: ['https://site.com', 'https://site.com/help', 'https://site.com/page']

Explanation: Malformed and relative links are ignored. Only valid same-host URLs are returned.

Input: ('not a url', {'not a url': ['http://a.com']})

Expected Output: []

Explanation: Edge case: the start URL itself is malformed, so nothing can be crawled.

Input: ('http://solo.com', {})

Expected Output: ['http://solo.com']

Explanation: Edge case: the start URL is valid but has no outgoing links and is missing from the map.

Solution

def solution(start_url, link_map):
    from urllib.parse import urlparse

    def host_of(url):
        try:
            parsed = urlparse(url)
        except Exception:
            return None
        if not parsed.scheme or not parsed.hostname:
            return None
        return parsed.hostname

    start_host = host_of(start_url)
    if start_host is None:
        return []

    seen = {start_url}
    stack = [start_url]

    while stack:
        url = stack.pop()
        for nxt in link_map.get(url, []):
            if nxt in seen:
                continue
            if host_of(nxt) != start_host:
                continue
            seen.add(nxt)
            stack.append(nxt)

    return sorted(seen)

Time complexity: O(V + E + V log V), where V is the number of reachable same-host valid URLs and E is the number of outgoing links examined.. Space complexity: O(V).

Hints

  1. Use a visited set so cycles and duplicate links do not cause repeated work.
  2. Parse the hostname from each URL; checking whether a string starts with the same prefix is not reliable.

Part 2: Simulate a Concurrent Web Crawler with Per-Host Rate Limits

This is the concurrency follow-up to the crawler. You are given a list of seed URLs, a link map, a number of workers, and a per-host rate limit. Simulate the crawler in discrete time. Rules: 1. Seed URLs are discovered at time 0. 2. Fetching any URL takes exactly 1 time unit. 3. The outgoing links of a page become discovered only when that page finishes fetching. 4. A URL must never be fetched more than once, even if discovered many times. 5. Ignore malformed URLs. 6. For each hostname, if a fetch on that host starts at time t, the next fetch on the same host may start no earlier than t + rate_limit. 7. Whenever workers are free, repeatedly assign the lexicographically smallest currently eligible URL to the smallest available worker id. Unlike Part 1, this scheduler may crawl URLs on multiple hosts so that per-host throttling is meaningful. Return both the set of fetched URLs and the exact start schedule.

Constraints

  • 1 <= workers <= 50
  • 0 <= rate_limit <= 10^4
  • 0 <= total number of URLs discovered/fetched <= 10^4
  • The sum of all link-list lengths can be up to 2 * 10^5.
  • A valid URL must contain both a scheme and a hostname.
  • Treat two URLs as the same page only if their full strings are exactly equal.

Examples

Input: (['http://a.com', 'http://b.com'], {'http://a.com': ['http://a.com/1', 'http://b.com/1'], 'http://b.com': ['http://b.com/2', 'http://a.com/1'], 'http://a.com/1': [], 'http://b.com/1': [], 'http://b.com/2': []}, 2, 2)

Expected Output: (['http://a.com', 'http://a.com/1', 'http://b.com', 'http://b.com/1', 'http://b.com/2'], [(0, 0, 'http://a.com'), (0, 1, 'http://b.com'), (2, 0, 'http://a.com/1'), (2, 1, 'http://b.com/1'), (4, 0, 'http://b.com/2')])

Explanation: The two seed URLs start in parallel at time 0. Each host must wait 2 time units before another fetch on the same host can start.

Input: (['http://x.com'], {'http://x.com': ['http://x.com/a', 'http://x.com/b', 'http://x.com/a'], 'http://x.com/a': [], 'http://x.com/b': []}, 2, 0)

Expected Output: (['http://x.com', 'http://x.com/a', 'http://x.com/b'], [(0, 0, 'http://x.com'), (1, 0, 'http://x.com/a'), (1, 1, 'http://x.com/b')])

Explanation: With rate_limit = 0, the same host can start multiple fetches at the same time once the links are discovered. Duplicate discovery of /a is ignored.

Input: (['not a url', 'https://good.com'], {'https://good.com': ['/relative', 'https://good.com/p'], 'https://good.com/p': ['bad://']}, 1, 1)

Expected Output: (['https://good.com', 'https://good.com/p'], [(0, 0, 'https://good.com'), (1, 0, 'https://good.com/p')])

Explanation: Malformed seeds and malformed discovered links are ignored. The valid page on good.com is fetched, then its child is fetched one time unit later.

Input: ([], {}, 3, 5)

Expected Output: ([], [])

Explanation: Edge case: no seed URLs means there is nothing to crawl.

Solution

def solution(start_urls, link_map, workers, rate_limit):
    from collections import defaultdict
    from urllib.parse import urlparse
    import heapq

    def host_of(url):
        try:
            parsed = urlparse(url)
        except Exception:
            return None
        if not parsed.scheme or not parsed.hostname:
            return None
        return parsed.hostname

    if workers <= 0 or rate_limit < 0:
        return ([], [])

    waiting = defaultdict(list)          # host -> min-heap of discovered, not-yet-started URLs
    seen = set()                         # URLs already discovered
    next_allowed = defaultdict(int)      # host -> earliest next start time
    advert_version = defaultdict(int)    # invalidates stale ready-heap entries

    ready = []                           # (url, host, version)
    cooldown = []                        # (time, host)
    running = []                         # (finish_time, worker_id, url)
    free_workers = list(range(workers))
    heapq.heapify(free_workers)

    fetched = []
    schedule = []
    current_time = 0

    def discover(url):
        nonlocal current_time
        host = host_of(url)
        if host is None or url in seen:
            return

        seen.add(url)
        before_top = waiting[host][0] if waiting[host] else None
        heapq.heappush(waiting[host], url)
        after_top = waiting[host][0]

        if next_allowed[host] <= current_time:
            if before_top is None or after_top != before_top:
                advert_version[host] += 1
                heapq.heappush(ready, (after_top, host, advert_version[host]))
        else:
            heapq.heappush(cooldown, (next_allowed[host], host))

    for url in start_urls:
        discover(url)

    while True:
        while running and running[0][0] <= current_time:
            _, worker_id, url = heapq.heappop(running)
            heapq.heappush(free_workers, worker_id)
            for nxt in link_map.get(url, []):
                discover(nxt)

        while cooldown and cooldown[0][0] <= current_time:
            _, host = heapq.heappop(cooldown)
            if waiting[host] and next_allowed[host] <= current_time:
                advert_version[host] += 1
                heapq.heappush(ready, (waiting[host][0], host, advert_version[host]))

        while free_workers:
            candidate = None

            while ready:
                url, host, version = heapq.heappop(ready)
                if version != advert_version[host]:
                    continue
                if not waiting[host] or waiting[host][0] != url:
                    continue
                if next_allowed[host] > current_time:
                    heapq.heappush(cooldown, (next_allowed[host], host))
                    continue
                candidate = (url, host)
                break

            if candidate is None:
                break

            url, host = candidate
            worker_id = heapq.heappop(free_workers)
            heapq.heappop(waiting[host])

            schedule.append((current_time, worker_id, url))
            fetched.append(url)
            heapq.heappush(running, (current_time + 1, worker_id, url))

            next_allowed[host] = current_time + rate_limit
            advert_version[host] += 1

            if waiting[host]:
                if next_allowed[host] <= current_time:
                    advert_version[host] += 1
                    heapq.heappush(ready, (waiting[host][0], host, advert_version[host]))
                else:
                    heapq.heappush(cooldown, (next_allowed[host], host))

        if not running and not cooldown:
            break

        next_time = float('inf')
        if running:
            next_time = min(next_time, running[0][0])
        if cooldown:
            next_time = min(next_time, cooldown[0][0])

        if next_time == float('inf'):
            break

        current_time = next_time

    return (sorted(fetched), schedule)

Time complexity: O((V + E) log V) in the usual heap-based model, where V is the number of valid unique URLs discovered and E is the total number of links processed.. Space complexity: O(V).

Hints

  1. Think in terms of events: fetch completions and host cooldown expirations both change what can run next.
  2. Use heaps to manage the next free worker, the next completion time, and the lexicographically smallest eligible URL.
Last updated: Apr 20, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)
  • Implement a Least-Recently-Used Cache - Anthropic (medium)
  • Convert Samples into Event Intervals - Anthropic (medium)
  • Convert State Stream to Events - Anthropic (medium)