How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Build a concurrent web crawler | Anthropic Coding Question

Build a concurrent web crawler

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a web crawler that starts from a given URL and visits every reachable page on the same host. You are given: - `start_url`: a string representing the first page to crawl. - `fetch_links(url) -> list[str]`: a blocking function that returns all URLs found on the page. Requirements: 1. Crawl pages concurrently using multiple threads. 2. Only visit URLs whose host is exactly the same as the host of `start_url`. 3. Visit each eligible URL at most once, even if multiple pages link to it. 4. Return all visited URLs in any order. 5. Your implementation must be thread-safe. You may assume: - `fetch_links` is thread-safe. - URL strings are valid. - The crawl graph is finite. Discuss the main synchronization challenges, such as protecting the shared visited set and coordinating worker threads efficiently.

Quick Answer: This question evaluates understanding of concurrent programming, synchronization, thread safety, and graph traversal as applied to a multithreaded web crawler.

You are given a starting URL and a directed graph of web pages. The graph is represented by a dictionary where each key is a URL and its value is the list of URLs found on that page. Implement the core logic of a concurrent web crawler: starting from `start_url`, crawl every reachable page that belongs to the same hostname as `start_url`. Ignore links to other hostnames. Pages may contain duplicate links, cycles, and some discovered URLs may not appear as keys in the dictionary (treat those pages as having no outgoing links). Because a real concurrent crawler can visit pages in any order, your function must return the final set of crawled URLs sorted in lexicographical order.

Constraints

`start_url` is a non-empty string and represents a valid URL beginning with `http://`.
The number of pages in `links` can be up to 10^4.
The total number of links across all pages can be up to 10^5.
There may be cycles, duplicate links, and URLs in link lists that are not keys in `links`.

Examples

Input: ("http://news.com", {"http://news.com": ["http://news.com/world", "http://sports.com", "http://news.com/about"], "http://news.com/world": ["http://news.com", "http://news.com/world/europe"], "http://news.com/about": [], "http://news.com/world/europe": []})

Expected Output: ["http://news.com", "http://news.com/about", "http://news.com/world", "http://news.com/world/europe"]

Explanation: Starting from news.com, only URLs on the hostname news.com are crawled. The sports.com link is ignored.

Input: ("http://a.com/x", {"http://a.com/x": ["http://a.com/y", "http://a.com/y", "http://b.com/z"], "http://a.com/y": ["http://a.com/x", "http://a.com/z"], "http://a.com/z": ["http://a.com/z"]})

Expected Output: ["http://a.com/x", "http://a.com/y", "http://a.com/z"]

Explanation: Duplicate links and cycles should not cause repeated crawling. The b.com URL is a different hostname and is skipped.

Input: ("http://solo.com/page", {})

Expected Output: ["http://solo.com/page"]

Explanation: Even if the page is not present in the mapping, the crawler still visits the start page itself.

Input: ("http://x.com", {"http://x.com": ["http://x.com/a", "http://x.com/b"], "http://x.com/b": ["http://x.com/c"]})

Expected Output: ["http://x.com", "http://x.com/a", "http://x.com/b", "http://x.com/c"]

Explanation: Discovered same-host pages are included even if some of them are not keys in the dictionary; they are treated as pages with no outgoing links.

Hints

Model the website as a graph and use DFS or BFS starting from `start_url`.
Write a helper that extracts the hostname: the part after `http://` and before the next `/` (or the end of the string).

Quick Overview

This question evaluates understanding of concurrent programming, synchronization, thread safety, and graph traversal as applied to a multithreaded web crawler.