PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of concurrent programming, synchronization, thread safety, and graph traversal as applied to a multithreaded web crawler.

  • medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Build a concurrent web crawler

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a web crawler that starts from a given URL and visits every reachable page on the same host. You are given: - `start_url`: a string representing the first page to crawl. - `fetch_links(url) -> list[str]`: a blocking function that returns all URLs found on the page. Requirements: 1. Crawl pages concurrently using multiple threads. 2. Only visit URLs whose host is exactly the same as the host of `start_url`. 3. Visit each eligible URL at most once, even if multiple pages link to it. 4. Return all visited URLs in any order. 5. Your implementation must be thread-safe. You may assume: - `fetch_links` is thread-safe. - URL strings are valid. - The crawl graph is finite. Discuss the main synchronization challenges, such as protecting the shared visited set and coordinating worker threads efficiently.

Quick Answer: This question evaluates understanding of concurrent programming, synchronization, thread safety, and graph traversal as applied to a multithreaded web crawler.

You are given a starting URL and a directed graph of web pages. The graph is represented by a dictionary where each key is a URL and its value is the list of URLs found on that page. Implement the core logic of a concurrent web crawler: starting from `start_url`, crawl every reachable page that belongs to the same hostname as `start_url`. Ignore links to other hostnames. Pages may contain duplicate links, cycles, and some discovered URLs may not appear as keys in the dictionary (treat those pages as having no outgoing links). Because a real concurrent crawler can visit pages in any order, your function must return the final set of crawled URLs sorted in lexicographical order.

Constraints

  • `start_url` is a non-empty string and represents a valid URL beginning with `http://`.
  • The number of pages in `links` can be up to 10^4.
  • The total number of links across all pages can be up to 10^5.
  • There may be cycles, duplicate links, and URLs in link lists that are not keys in `links`.

Examples

Input: ("http://news.com", {"http://news.com": ["http://news.com/world", "http://sports.com", "http://news.com/about"], "http://news.com/world": ["http://news.com", "http://news.com/world/europe"], "http://news.com/about": [], "http://news.com/world/europe": []})

Expected Output: ["http://news.com", "http://news.com/about", "http://news.com/world", "http://news.com/world/europe"]

Explanation: Starting from news.com, only URLs on the hostname news.com are crawled. The sports.com link is ignored.

Input: ("http://a.com/x", {"http://a.com/x": ["http://a.com/y", "http://a.com/y", "http://b.com/z"], "http://a.com/y": ["http://a.com/x", "http://a.com/z"], "http://a.com/z": ["http://a.com/z"]})

Expected Output: ["http://a.com/x", "http://a.com/y", "http://a.com/z"]

Explanation: Duplicate links and cycles should not cause repeated crawling. The b.com URL is a different hostname and is skipped.

Input: ("http://solo.com/page", {})

Expected Output: ["http://solo.com/page"]

Explanation: Even if the page is not present in the mapping, the crawler still visits the start page itself.

Input: ("http://x.com", {"http://x.com": ["http://x.com/a", "http://x.com/b"], "http://x.com/b": ["http://x.com/c"]})

Expected Output: ["http://x.com", "http://x.com/a", "http://x.com/b", "http://x.com/c"]

Explanation: Discovered same-host pages are included even if some of them are not keys in the dictionary; they are treated as pages with no outgoing links.

Hints

  1. Model the website as a graph and use DFS or BFS starting from `start_url`.
  2. Write a helper that extracts the hostname: the part after `http://` and before the next `/` (or the end of the string).
Last updated: Apr 19, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement a Banking System - Anthropic (medium)
  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)