PracHub
QuestionsPremiumLearningGuidesCheatsheetNEW

Quick Overview

This question evaluates skills in web crawling, URL parsing and fragment sanitization, graph traversal for deduplication, and concurrent programming for thread-safe crawling, and it falls under the Coding & Algorithms domain.

  • medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Design a single- and multi-threaded web crawler

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

## Web Crawler (single-threaded, then multi-threaded) You are given: - A starting URL `startUrl` (e.g., `"http://news.example.com/a/index.html"`). - An interface `HtmlParser` with a method `getUrls(url)` that returns **all URLs** found on the page at `url`. Your task is to crawl web pages starting from `startUrl` and return all **unique** pages that are reachable by following links, subject to the rules below. ### Rules 1. **Same-hostname only:** Only crawl pages whose hostname is the same as the hostname of `startUrl`. - The hostname is the part between the scheme (`http://` or `https://`) and the next `/`. 2. **Fragment handling:** URLs may contain a fragment part starting with `#` (e.g., `http://a.com/x#section2`). - You must **strip the fragment** (remove `#` and everything after it) before: - deciding whether you have already visited the URL (deduplication), and - including the URL in the returned result. - Example: `http://a.com/x#p1` and `http://a.com/x#p2` should be treated as the **same page**: `http://a.com/x`. 3. **No other normalization:** Do **not** perform other URL normalization (e.g., do not canonicalize trailing slashes, default ports, query parameter ordering, etc.). Only strip the fragment. ### Part A — Single-threaded Implement a single-threaded crawler that returns the set/list of visited URLs (after stripping fragments), restricted to the same hostname. ### Part B — Follow-up: Multi-threaded Now implement a multi-threaded crawler to speed up crawling. - Multiple worker threads may call `HtmlParser.getUrls` concurrently. - Your solution must be thread-safe and must not crawl the same sanitized URL more than once. ### Output Return all unique sanitized URLs that are reachable from `startUrl` and satisfy the hostname constraint. ### Assumptions / Notes - `HtmlParser.getUrls(url)` is a black box and may be slow (network/IO-like). - The link graph may contain cycles. - URLs returned by `getUrls` are absolute URLs.

Quick Answer: This question evaluates skills in web crawling, URL parsing and fragment sanitization, graph traversal for deduplication, and concurrent programming for thread-safe crawling, and it falls under the Coding & Algorithms domain.

Part 1: Single-Threaded Web Crawler

You are given a starting URL `start_url` and a dictionary `web` that simulates `HtmlParser.getUrls(url)`. For any crawled page `u`, its outgoing links are `web.get(u, [])`. Crawl all pages reachable from `start_url` using a single thread, with these rules: 1. Only crawl URLs whose hostname matches the hostname of `start_url`. 2. Before checking whether a URL was already visited, strip its fragment part (`#...`). 3. Return each page only once, using its fragment-free form. 4. Do not perform any other normalization. For example, `http://a.com`, `http://a.com/`, and `http://a.com:80` are all different URLs. To make the output deterministic, return the visited URLs as a lexicographically sorted list. Notes: - `web` keys represent fragment-free page URLs. - If a crawled page is missing from `web`, treat it as a page with no outgoing links.

Constraints

  • 0 <= len(web) <= 10^4
  • 0 <= total number of links across all pages <= 5 * 10^4
  • All URLs are absolute and start with `http://` or `https://`
  • Only fragment stripping is allowed; no other URL normalization may be applied
  • The link graph may contain cycles

Examples

Input: ('http://news.example.com/a/index.html#top', {'http://news.example.com/a/index.html': ['http://news.example.com/b#section', 'http://other.example.com/x', 'http://news.example.com/c', 'http://news.example.com/b#other'], 'http://news.example.com/b': ['http://news.example.com/c#frag', 'http://news.example.com/a/index.html#again'], 'http://news.example.com/c': ['http://news.example.com/c#self']})

Expected Output: ['http://news.example.com/a/index.html', 'http://news.example.com/b', 'http://news.example.com/c']

Explanation: The crawler starts at the fragment-free URL `http://news.example.com/a/index.html`, ignores the external hostname, removes fragments before deduplication, and reaches pages a, b, and c.

Input: ('https://a.com/home', {'https://a.com/home': ['https://a.com/about#team', 'https://b.com/out', 'https://a.com/about#company'], 'https://a.com/about': ['https://a.com/home#top', 'https://a.com/about#self']})

Expected Output: ['https://a.com/about', 'https://a.com/home']

Explanation: Only URLs on hostname `a.com` are crawled. The two fragment variants of the about page collapse into one visited URL.

Input: ('http://a.com#start', {'http://a.com': ['http://a.com/#top', 'http://a.com:80/#frag', 'https://a.com/path#x', 'http://b.com/page'], 'http://a.com/': ['http://a.com#again'], 'http://a.com:80/': []})

Expected Output: ['http://a.com', 'http://a.com/', 'http://a.com:80/', 'https://a.com/path']

Explanation: All returned URLs share hostname `a.com`, but they remain distinct because only fragments are removed. The missing page `https://a.com/path` is still visited and treated as having no outgoing links.

Input: ('https://solo.example.org/page#frag', {})

Expected Output: ['https://solo.example.org/page']

Explanation: Even if the start page is missing from `web`, it is still reachable from itself after fragment removal.

Solution

def solution(start_url, web):
    from urllib.parse import urlsplit

    def strip_fragment(url):
        return url.split('#', 1)[0]

    def hostname(url):
        try:
            return urlsplit(url).hostname
        except Exception:
            return None

    start_clean = strip_fragment(start_url)
    start_host = hostname(start_clean)
    if start_host is None:
        return []

    visited = set()
    stack = [start_clean]

    while stack:
        url = strip_fragment(stack.pop())
        if url in visited:
            continue
        if hostname(url) != start_host:
            continue

        visited.add(url)

        for next_url in web.get(url, []):
            clean = strip_fragment(next_url)
            if hostname(clean) == start_host and clean not in visited:
                stack.append(clean)

    return sorted(visited)

Time complexity: O(V + E + V log V). Space complexity: O(V).

Hints

  1. Strip the fragment from every URL as soon as you read it, before deduplication.
  2. A DFS or BFS with a `visited` set is enough once you can extract the hostname.

Part 2: Multi-Threaded Web Crawler

You are given the same simulated web as in Part 1: a starting URL `start_url` and a dictionary `web` where `web.get(u, [])` represents the URLs returned by `HtmlParser.getUrls(u)`. Implement a multi-threaded crawler that follows the same rules: 1. Only crawl URLs whose hostname matches the hostname of `start_url`. 2. Strip fragments (`#...`) before deduplication and before adding a URL to the result. 3. Do not perform any other normalization. 4. The same sanitized URL must never be crawled more than once, even when multiple threads run concurrently. Return the final set of visited pages as a lexicographically sorted list. Notes: - `web` keys represent fragment-free page URLs. - If a crawled page is missing from `web`, treat it as having no outgoing links. - Your implementation should use thread-safe coordination internally.

Constraints

  • 0 <= len(web) <= 10^4
  • 0 <= total number of links across all pages <= 5 * 10^4
  • All URLs are absolute and start with `http://` or `https://`
  • Only fragment stripping is allowed; no other URL normalization may be applied
  • The link graph may contain cycles
  • The solution must be thread-safe

Solution

def solution(start_url, web):
    from queue import Queue
    from threading import Lock, Thread
    from urllib.parse import urlsplit

    def sanitize(url):
        return url.split('#', 1)[0]

    def get_host(url):
        return urlsplit(url).netloc

    start = sanitize(start_url)
    target_host = get_host(start)

    seen = {start}
    seen_lock = Lock()
    tasks = Queue()
    tasks.put(start)

    def worker():
        while True:
            current = tasks.get()
            if current is None:
                tasks.task_done()
                return

            for nxt in web.get(current, []):
                nxt = sanitize(nxt)
                if get_host(nxt) != target_host:
                    continue

                should_enqueue = False
                with seen_lock:
                    if nxt not in seen:
                        seen.add(nxt)
                        should_enqueue = True

                if should_enqueue:
                    tasks.put(nxt)

            tasks.task_done()

    worker_count = 4
    threads = [Thread(target=worker) for _ in range(worker_count)]
    for t in threads:
        t.start()

    tasks.join()

    for _ in threads:
        tasks.put(None)
    tasks.join()

    for t in threads:
        t.join()

    return sorted(seen)

Time complexity: O(V + E) total work, where V is the number of reachable same-host pages after fragment stripping and E is the number of examined links. Wall-clock time can improve with multiple threads when link fetching is slow.. Space complexity: O(V + W), where W is the number of worker threads; dominated by the visited set and task queue..

Hints

  1. Use a shared `visited` set protected by a lock, and mark a URL as visited before enqueueing it.
  2. A thread-safe queue is a natural way to distribute crawl tasks among worker threads.
Last updated: Apr 26, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Convert Samples into Event Intervals - Anthropic (medium)
  • Convert State Stream to Events - Anthropic (medium)
  • Build a concurrent web crawler - Anthropic (medium)
  • Implement a Parallel Image Processor - Anthropic (medium)
  • Implement a Batch Image Processor - Anthropic (medium)