Implement crawler and file deduplication
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
Quick Answer: This question evaluates competencies in concurrent programming, web crawling and graph traversal, thread safety, rate limiting, I/O-efficient file deduplication using hashing, and scalable system design within the Coding & Algorithms domain.
Part 1: Single-Threaded Same-Domain Web Crawler
Constraints
- 0 <= len(pages) <= 10^4
- The total number of links across all page lists is at most 10^5
- All URLs are absolute URLs
- Same-domain comparison uses the URL netloc only; scheme may differ, but subdomains count as different domains
Examples
Input: ("https://example.com", {"https://example.com": ["https://example.com/about", "https://example.com/contact", "https://other.com/skip"], "https://example.com/about": ["https://example.com/team", "https://example.com"], "https://example.com/contact": ["https://example.com/team", "https://other.com/x"], "https://example.com/team": []})
Expected Output: ["https://example.com", "https://example.com/about", "https://example.com/contact", "https://example.com/team"]
Explanation: Standard crawl with a cycle and external links. Only same-domain URLs are followed, and each page is visited once.
Input: ("https://example.com/start", {})
Expected Output: ["https://example.com/start"]
Explanation: Edge case: the seed URL is not present in pages, so it is still returned by itself.
Input: ("https://example.com/home", {"https://example.com/home": ["https://example.com/home", "https://example.com/a", "https://example.com/a", "https://blog.example.com/post"], "https://example.com/a": ["https://example.com/home"]})
Expected Output: ["https://example.com/home", "https://example.com/a"]
Explanation: Self-links and duplicate links should not create duplicate visits. A subdomain is considered outside the domain.
Input: ("https://example.com", {"https://example.com": ["http://example.com/help", "https://blog.example.com/post"], "http://example.com/help": []})
Expected Output: ["https://example.com", "http://example.com/help"]
Explanation: Scheme differences are allowed because the host matches, but a different subdomain is ignored.
Hints
- Use a queue for BFS and a set to ensure each URL is visited at most once.
- Parse the host/netloc from the seed URL once, then compare every discovered link against it.
Part 2: Multithreaded Same-Domain Web Crawler
Constraints
- 0 <= num_workers <= 32
- 0 <= len(pages) <= 10^4
- The total number of links across all successful page lists is at most 10^5
- All URLs are absolute URLs
- Same-domain comparison uses exact netloc equality
Examples
Input: ("https://example.com", {"https://example.com": ["https://example.com/a", "https://example.com/b", "https://example.com/a", "https://other.com/out"], "https://example.com/a": ["https://example.com/c"], "https://example.com/b": ["https://example.com/c"], "https://example.com/c": []}, 3)
Expected Output: ["https://example.com", "https://example.com/a", "https://example.com/b", "https://example.com/c"]
Explanation: Two workers may reach c from different parents, but it must only be discovered once.
Input: ("https://example.com", {"https://example.com": ["https://example.com/a", "https://example.com/b"], "https://example.com/a": None, "https://example.com/b": ["https://example.com/c"]}, 2)
Expected Output: ["https://example.com", "https://example.com/a", "https://example.com/b", "https://example.com/c"]
Explanation: Page a fails to fetch, but it is still in the discovered set because it was linked from the seed page.
Input: ("https://example.com/start", {}, 4)
Expected Output: ["https://example.com/start"]
Explanation: Edge case: the seed page has no entry, so it behaves like an empty page.
Input: ("https://example.com", {"https://example.com": ["https://example.com", "https://example.com/x"], "https://example.com/x": ["https://example.com"]}, 0)
Expected Output: ["https://example.com", "https://example.com/x"]
Explanation: Edge case: num_workers = 0 should be treated as 1 worker, and the cycle must not loop forever.
Hints
- Use a shared queue of URLs to process and a shared seen set to suppress duplicates.
- The seen check and insertion must be protected so two threads do not enqueue the same URL simultaneously.
Part 3: Directory Tree File Deduplication
Constraints
- The root is a dictionary
- The total number of files is at most 10^5
- The total number of characters across all file contents is at most 10^6
- File size means the length of the content string
Examples
Input: ({"a.txt": "hello", "b.txt": "world", "sub": {"c.txt": "hello", "d.txt": "!"}, "sub2": {"e.txt": "world"}},)
Expected Output: [["a.txt", "sub/c.txt"], ["b.txt", "sub2/e.txt"]]
Explanation: There are two duplicate groups: the files containing hello and the files containing world.
Input: ({},)
Expected Output: []
Explanation: Edge case: an empty directory tree contains no duplicates.
Input: ({"x.txt": "", "dir": {"y.txt": "", "z.txt": "a"}},)
Expected Output: [["dir/y.txt", "x.txt"]]
Explanation: Empty files are valid and should be detected as duplicates.
Input: ({"a.txt": "ab", "b.txt": "cd", "c.txt": "ab"},)
Expected Output: [["a.txt", "c.txt"]]
Explanation: Files a.txt and b.txt have the same size but different content, so only a.txt and c.txt are duplicates.
Hints
- First traverse the tree to collect each file's full relative path and size.
- Only compute hashes for size buckets containing at least two files.