How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Anthropic during technical interviews.

Implement a web crawler using a provided API

Company: Anthropic

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

## Web Crawler (BFS) You are implementing a simple web crawler. ### Given - A starting URL `startUrl`. - A provided API: - `List<String> getUrls(String url)` which returns all URLs (as strings) found on the page at `url`. ### Task Return **all unique URLs** that are **reachable** from `startUrl` by repeatedly calling `getUrls`, subject to these constraints: 1. **Only crawl URLs with the same hostname as `startUrl`.** 2. **Do not visit the same URL more than once.** ### Requirements - Use a graph traversal approach (e.g., BFS with a queue). - The output can be in any order. ### Follow-up How would you modify your crawler to use **multiple threads** to improve throughput while still ensuring: - Each URL is fetched at most once. - Only same-host URLs are crawled. - The program terminates correctly when the crawl is complete.

Quick Answer: This question evaluates understanding of web crawling mechanics, URL/hostname filtering, graph traversal concepts, and concurrent fetching, assessing skills in reachability determination, duplicate detection, and thread-safety.

Part 1: Single-Threaded Same-Host Web Crawler

You are given a starting URL `start_url` and a preloaded web graph `web`, where `web[url]` is the list of URLs found on that page. Starting from `start_url`, crawl pages using breadth-first search (BFS). Rules: 1. Only follow URLs whose hostname is the same as the hostname of `start_url`. 2. Never visit the same URL more than once. 3. If a URL does not appear as a key in `web`, treat it as a page with no outgoing links. 4. Return the URLs in the order they are first visited by BFS. The hostname is the part after `http://` or `https://` and before the next `/` (or the end of the string). The scheme itself does not matter for hostname matching.

Constraints

1 <= len(start_url) <= 200
0 <= sum(len(links) for links in web.values()) <= 200000
Each URL is a valid string starting with `http://` or `https://`.
If a reachable URL is missing from `web`, it has no outgoing links.
Preserve the given order of links when exploring neighbors during BFS.

Examples

Input: ('http://news.com', {'http://news.com': ['http://news.com/a', 'http://other.com/x', 'http://news.com/b'], 'http://news.com/a': ['http://news.com/b', 'http://news.com/c'], 'http://news.com/b': ['http://news.com'], 'http://news.com/c': [], 'http://other.com/x': ['http://news.com/c']})

Expected Output: ['http://news.com', 'http://news.com/a', 'http://news.com/b', 'http://news.com/c']

Explanation: The crawler ignores `other.com`, visits same-host pages once, and follows BFS order.

Input: ('https://site.org/home', {'https://site.org/home': ['http://site.org/a', 'https://site.org/b', 'https://othersite.org/z'], 'http://site.org/a': ['https://site.org/home', 'https://site.org/c'], 'https://site.org/b': [], 'https://site.org/c': []})

Expected Output: ['https://site.org/home', 'http://site.org/a', 'https://site.org/b', 'https://site.org/c']

Explanation: Both `http://site.org/...` and `https://site.org/...` share the same hostname, so both are crawlable.

Input: ('http://solo.com', {})

Expected Output: ['http://solo.com']

Explanation: Edge case: the start page is not in the graph, so it is the only visited URL.

Input: ('http://a.com/x', {'http://a.com/x': ['http://a.com/x', 'http://a.com/y', 'http://a.com/y'], 'http://a.com/y': ['http://b.com/q']})

Expected Output: ['http://a.com/x', 'http://a.com/y']

Explanation: Self-loops and duplicate links should not cause repeated visits, and `b.com` is filtered out.

Hints

Use a queue for BFS and a set to remember which URLs have already been seen.
Mark a URL as visited when you enqueue it, not when you dequeue it, so duplicates are avoided early.

Part 2: Simulate a Multi-Worker Web Crawler

You are given the same web graph as in Part 1, but now the crawler has `num_workers` workers. Simulate the crawler in synchronous rounds: 1. At the start of each round, up to `num_workers` URLs are removed from the front of the queue and fetched simultaneously. 2. URLs discovered during that round are added to the back of the queue for future rounds only. 3. Only URLs with the same hostname as `start_url` may be queued. 4. Each URL may be fetched at most once in the entire crawl. 5. If multiple fetched pages in the same round discover the same URL, only the first discovery should enqueue it. 6. If a URL does not appear in `web`, treat it as a page with no outgoing links. To keep the simulation deterministic, when processing a round, examine the fetched pages in the order they were removed from the queue, and examine each page's outgoing links in the given order. Return the total number of rounds needed until the crawler finishes.

Constraints

1 <= num_workers <= 100000
1 <= len(start_url) <= 200
0 <= sum(len(links) for links in web.values()) <= 200000
Each URL is a valid string starting with `http://` or `https://`.
If a reachable URL is missing from `web`, it has no outgoing links.

Examples

Expected Output: 3

Explanation: Round 1 fetches the start page, round 2 fetches `a` and `b`, and round 3 fetches `c`.

Input: ('http://a.com', {'http://a.com': ['http://a.com/p1', 'http://a.com/p2'], 'http://a.com/p1': ['http://a.com/c', 'http://a.com/d'], 'http://a.com/p2': ['http://a.com/c', 'http://a.com/e', 'http://b.com/x'], 'http://a.com/c': [], 'http://a.com/d': [], 'http://a.com/e': []}, 2)

Expected Output: 4

Explanation: Both workers in round 2 discover `http://a.com/c`, but it must be enqueued only once.

Input: ('https://solo.org', {}, 5)

Expected Output: 1

Explanation: Edge case: only the start page exists, so the crawl finishes in one round.

Input: ('https://site.org/home', {'https://site.org/home': ['https://site.org/a', 'https://site.org/b', 'https://other.org/x'], 'https://site.org/a': ['https://site.org/c'], 'https://site.org/b': ['https://site.org/d'], 'https://site.org/c': [], 'https://site.org/d': []}, 1)

Expected Output: 5

Explanation: With one worker, the process becomes a serial BFS-style crawl: home, a, b, c, d.

Hints

Use one queue for pending URLs and process it in batches of size at most `num_workers`.
To guarantee at-most-once fetch across workers, add a URL to `visited` as soon as it is enqueued, not later.

Quick Overview

This question evaluates understanding of web crawling mechanics, URL/hostname filtering, graph traversal concepts, and concurrent fetching, assessing skills in reachability determination, duplicate detection, and thread-safety.