PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of web crawling mechanics, URL/hostname filtering, graph traversal concepts, and concurrent fetching, assessing skills in reachability determination, duplicate detection, and thread-safety.

  • medium
  • Anthropic
  • Coding & Algorithms
  • Machine Learning Engineer

Implement a web crawler using a provided API

Company: Anthropic

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

## Web Crawler (BFS) You are implementing a simple web crawler. ### Given - A starting URL `startUrl`. - A provided API: - `List<String> getUrls(String url)` which returns all URLs (as strings) found on the page at `url`. ### Task Return **all unique URLs** that are **reachable** from `startUrl` by repeatedly calling `getUrls`, subject to these constraints: 1. **Only crawl URLs with the same hostname as `startUrl`.** 2. **Do not visit the same URL more than once.** ### Requirements - Use a graph traversal approach (e.g., BFS with a queue). - The output can be in any order. ### Follow-up How would you modify your crawler to use **multiple threads** to improve throughput while still ensuring: - Each URL is fetched at most once. - Only same-host URLs are crawled. - The program terminates correctly when the crawl is complete.

Quick Answer: This question evaluates understanding of web crawling mechanics, URL/hostname filtering, graph traversal concepts, and concurrent fetching, assessing skills in reachability determination, duplicate detection, and thread-safety.

Part 1: Single-Threaded Same-Host Web Crawler

You are given a starting URL `start_url` and a preloaded web graph `web`, where `web[url]` is the list of URLs found on that page. Starting from `start_url`, crawl pages using breadth-first search (BFS). Rules: 1. Only follow URLs whose hostname is the same as the hostname of `start_url`. 2. Never visit the same URL more than once. 3. If a URL does not appear as a key in `web`, treat it as a page with no outgoing links. 4. Return the URLs in the order they are first visited by BFS. The hostname is the part after `http://` or `https://` and before the next `/` (or the end of the string). The scheme itself does not matter for hostname matching.

Constraints

  • 1 <= len(start_url) <= 200
  • 0 <= sum(len(links) for links in web.values()) <= 200000
  • Each URL is a valid string starting with `http://` or `https://`.
  • If a reachable URL is missing from `web`, it has no outgoing links.
  • Preserve the given order of links when exploring neighbors during BFS.

Examples

Input: ('http://news.com', {'http://news.com': ['http://news.com/a', 'http://other.com/x', 'http://news.com/b'], 'http://news.com/a': ['http://news.com/b', 'http://news.com/c'], 'http://news.com/b': ['http://news.com'], 'http://news.com/c': [], 'http://other.com/x': ['http://news.com/c']})

Expected Output: ['http://news.com', 'http://news.com/a', 'http://news.com/b', 'http://news.com/c']

Explanation: The crawler ignores `other.com`, visits same-host pages once, and follows BFS order.

Input: ('https://site.org/home', {'https://site.org/home': ['http://site.org/a', 'https://site.org/b', 'https://othersite.org/z'], 'http://site.org/a': ['https://site.org/home', 'https://site.org/c'], 'https://site.org/b': [], 'https://site.org/c': []})

Expected Output: ['https://site.org/home', 'http://site.org/a', 'https://site.org/b', 'https://site.org/c']

Explanation: Both `http://site.org/...` and `https://site.org/...` share the same hostname, so both are crawlable.

Input: ('http://solo.com', {})

Expected Output: ['http://solo.com']

Explanation: Edge case: the start page is not in the graph, so it is the only visited URL.

Input: ('http://a.com/x', {'http://a.com/x': ['http://a.com/x', 'http://a.com/y', 'http://a.com/y'], 'http://a.com/y': ['http://b.com/q']})

Expected Output: ['http://a.com/x', 'http://a.com/y']

Explanation: Self-loops and duplicate links should not cause repeated visits, and `b.com` is filtered out.

Hints

  1. Use a queue for BFS and a set to remember which URLs have already been seen.
  2. Mark a URL as visited when you enqueue it, not when you dequeue it, so duplicates are avoided early.

Part 2: Simulate a Multi-Worker Web Crawler

You are given the same web graph as in Part 1, but now the crawler has `num_workers` workers. Simulate the crawler in synchronous rounds: 1. At the start of each round, up to `num_workers` URLs are removed from the front of the queue and fetched simultaneously. 2. URLs discovered during that round are added to the back of the queue for future rounds only. 3. Only URLs with the same hostname as `start_url` may be queued. 4. Each URL may be fetched at most once in the entire crawl. 5. If multiple fetched pages in the same round discover the same URL, only the first discovery should enqueue it. 6. If a URL does not appear in `web`, treat it as a page with no outgoing links. To keep the simulation deterministic, when processing a round, examine the fetched pages in the order they were removed from the queue, and examine each page's outgoing links in the given order. Return the total number of rounds needed until the crawler finishes.

Constraints

  • 1 <= num_workers <= 100000
  • 1 <= len(start_url) <= 200
  • 0 <= sum(len(links) for links in web.values()) <= 200000
  • Each URL is a valid string starting with `http://` or `https://`.
  • If a reachable URL is missing from `web`, it has no outgoing links.

Examples

Input: ('http://news.com', {'http://news.com': ['http://news.com/a', 'http://other.com/x', 'http://news.com/b'], 'http://news.com/a': ['http://news.com/b', 'http://news.com/c'], 'http://news.com/b': ['http://news.com'], 'http://news.com/c': [], 'http://other.com/x': ['http://news.com/c']}, 2)

Expected Output: 3

Explanation: Round 1 fetches the start page, round 2 fetches `a` and `b`, and round 3 fetches `c`.

Input: ('http://a.com', {'http://a.com': ['http://a.com/p1', 'http://a.com/p2'], 'http://a.com/p1': ['http://a.com/c', 'http://a.com/d'], 'http://a.com/p2': ['http://a.com/c', 'http://a.com/e', 'http://b.com/x'], 'http://a.com/c': [], 'http://a.com/d': [], 'http://a.com/e': []}, 2)

Expected Output: 4

Explanation: Both workers in round 2 discover `http://a.com/c`, but it must be enqueued only once.

Input: ('https://solo.org', {}, 5)

Expected Output: 1

Explanation: Edge case: only the start page exists, so the crawl finishes in one round.

Input: ('https://site.org/home', {'https://site.org/home': ['https://site.org/a', 'https://site.org/b', 'https://other.org/x'], 'https://site.org/a': ['https://site.org/c'], 'https://site.org/b': ['https://site.org/d'], 'https://site.org/c': [], 'https://site.org/d': []}, 1)

Expected Output: 5

Explanation: With one worker, the process becomes a serial BFS-style crawl: home, a, b, c, d.

Hints

  1. Use one queue for pending URLs and process it in batches of size at most `num_workers`.
  2. To guarantee at-most-once fetch across workers, add a URL to `visited` as soon as it is enqueued, not later.
Last updated: May 20, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement a Banking System - Anthropic (medium)
  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)