PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates competencies in concurrent programming, web crawling and graph traversal, thread safety, rate limiting, I/O-efficient file deduplication using hashing, and scalable system design within the Coding & Algorithms domain.

  • medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Implement crawler and file deduplication

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

The interview included two coding exercises: 1. Build a web crawler starting from a seed URL within a single domain. First implement a single-threaded version that visits each reachable page at most once and returns the discovered URLs. Then extend it to a multithreaded version. Discuss duplicate suppression, thread safety, termination, rate limiting, and how to handle slow or failing pages. 2. Build a file deduplication tool for a directory tree. Detect duplicates efficiently by first grouping files by size and then confirming duplicates with a content hash. Discuss tradeoffs between I/O-bound and CPU-bound work, how to process very large files, how to scale to huge numbers of files, and how to support near-real-time detection when files are added or modified.

Quick Answer: This question evaluates competencies in concurrent programming, web crawling and graph traversal, thread safety, rate limiting, I/O-efficient file deduplication using hashing, and scalable system design within the Coding & Algorithms domain.

Part 1: Single-Threaded Same-Domain Web Crawler

You are given a seed URL and an in-memory representation of the web where pages[url] is the list of URLs linked from that page. Write a single-threaded crawler that starts from the seed URL, only follows links whose host matches the seed URL's host, visits each URL at most once, and returns the visited URLs in BFS visitation order. The seed URL must always be included, even if it does not appear in pages. A missing page behaves like a page with no outgoing links. Same-domain means exact netloc match (host[:port]); subdomains are different domains. Treat two pages as the same only if their full URL strings are equal.

Constraints

  • 0 <= len(pages) <= 10^4
  • The total number of links across all page lists is at most 10^5
  • All URLs are absolute URLs
  • Same-domain comparison uses the URL netloc only; scheme may differ, but subdomains count as different domains

Examples

Input: ("https://example.com", {"https://example.com": ["https://example.com/about", "https://example.com/contact", "https://other.com/skip"], "https://example.com/about": ["https://example.com/team", "https://example.com"], "https://example.com/contact": ["https://example.com/team", "https://other.com/x"], "https://example.com/team": []})

Expected Output: ["https://example.com", "https://example.com/about", "https://example.com/contact", "https://example.com/team"]

Explanation: Standard crawl with a cycle and external links. Only same-domain URLs are followed, and each page is visited once.

Input: ("https://example.com/start", {})

Expected Output: ["https://example.com/start"]

Explanation: Edge case: the seed URL is not present in pages, so it is still returned by itself.

Input: ("https://example.com/home", {"https://example.com/home": ["https://example.com/home", "https://example.com/a", "https://example.com/a", "https://blog.example.com/post"], "https://example.com/a": ["https://example.com/home"]})

Expected Output: ["https://example.com/home", "https://example.com/a"]

Explanation: Self-links and duplicate links should not create duplicate visits. A subdomain is considered outside the domain.

Input: ("https://example.com", {"https://example.com": ["http://example.com/help", "https://blog.example.com/post"], "http://example.com/help": []})

Expected Output: ["https://example.com", "http://example.com/help"]

Explanation: Scheme differences are allowed because the host matches, but a different subdomain is ignored.

Hints

  1. Use a queue for BFS and a set to ensure each URL is visited at most once.
  2. Parse the host/netloc from the seed URL once, then compare every discovered link against it.

Part 2: Multithreaded Same-Domain Web Crawler

You are given a seed URL and an in-memory representation of the web where pages[url] is either a list of linked URLs or None if fetching that page fails. Build a multithreaded crawler that starts from the seed URL, uses num_workers worker threads, only follows links whose host matches the seed URL's host, and ensures that each URL is discovered at most once even when multiple threads see it at the same time. Missing pages behave like successful pages with no outgoing links. Failed pages (value None) are still considered discovered, but they produce no outgoing links. Because thread scheduling is nondeterministic, return the discovered URLs sorted lexicographically.

Constraints

  • 0 <= num_workers <= 32
  • 0 <= len(pages) <= 10^4
  • The total number of links across all successful page lists is at most 10^5
  • All URLs are absolute URLs
  • Same-domain comparison uses exact netloc equality

Examples

Input: ("https://example.com", {"https://example.com": ["https://example.com/a", "https://example.com/b", "https://example.com/a", "https://other.com/out"], "https://example.com/a": ["https://example.com/c"], "https://example.com/b": ["https://example.com/c"], "https://example.com/c": []}, 3)

Expected Output: ["https://example.com", "https://example.com/a", "https://example.com/b", "https://example.com/c"]

Explanation: Two workers may reach c from different parents, but it must only be discovered once.

Input: ("https://example.com", {"https://example.com": ["https://example.com/a", "https://example.com/b"], "https://example.com/a": None, "https://example.com/b": ["https://example.com/c"]}, 2)

Expected Output: ["https://example.com", "https://example.com/a", "https://example.com/b", "https://example.com/c"]

Explanation: Page a fails to fetch, but it is still in the discovered set because it was linked from the seed page.

Input: ("https://example.com/start", {}, 4)

Expected Output: ["https://example.com/start"]

Explanation: Edge case: the seed page has no entry, so it behaves like an empty page.

Input: ("https://example.com", {"https://example.com": ["https://example.com", "https://example.com/x"], "https://example.com/x": ["https://example.com"]}, 0)

Expected Output: ["https://example.com", "https://example.com/x"]

Explanation: Edge case: num_workers = 0 should be treated as 1 worker, and the cycle must not loop forever.

Hints

  1. Use a shared queue of URLs to process and a shared seen set to suppress duplicates.
  2. The seen check and insertion must be protected so two threads do not enqueue the same URL simultaneously.

Part 3: Directory Tree File Deduplication

You are given a nested dictionary representing a directory tree. Each key is a file or directory name. If the value is a string, it is that file's content. If the value is another dictionary, it is a subdirectory. Two files are duplicates if their contents are exactly equal. Build a deduplication report efficiently by first grouping files by size (number of characters in the content), and then hashing only files within the same size group to confirm duplicates. Return every duplicate group as a sorted list of relative file paths using '/' between directories. The outer list must be sorted by the first path in each group.

Constraints

  • The root is a dictionary
  • The total number of files is at most 10^5
  • The total number of characters across all file contents is at most 10^6
  • File size means the length of the content string

Examples

Input: ({"a.txt": "hello", "b.txt": "world", "sub": {"c.txt": "hello", "d.txt": "!"}, "sub2": {"e.txt": "world"}},)

Expected Output: [["a.txt", "sub/c.txt"], ["b.txt", "sub2/e.txt"]]

Explanation: There are two duplicate groups: the files containing hello and the files containing world.

Input: ({},)

Expected Output: []

Explanation: Edge case: an empty directory tree contains no duplicates.

Input: ({"x.txt": "", "dir": {"y.txt": "", "z.txt": "a"}},)

Expected Output: [["dir/y.txt", "x.txt"]]

Explanation: Empty files are valid and should be detected as duplicates.

Input: ({"a.txt": "ab", "b.txt": "cd", "c.txt": "ab"},)

Expected Output: [["a.txt", "c.txt"]]

Explanation: Files a.txt and b.txt have the same size but different content, so only a.txt and c.txt are duplicates.

Hints

  1. First traverse the tree to collect each file's full relative path and size.
  2. Only compute hashes for size buckets containing at least two files.
Last updated: Apr 19, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement a Banking System - Anthropic (medium)
  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)