Implement crawler and file deduplication
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
The interview included two coding exercises:
1. Build a web crawler starting from a seed URL within a single domain. First implement a single-threaded version that visits each reachable page at most once and returns the discovered URLs. Then extend it to a multithreaded version. Discuss duplicate suppression, thread safety, termination, rate limiting, and how to handle slow or failing pages.
2. Build a file deduplication tool for a directory tree. Detect duplicates efficiently by first grouping files by size and then confirming duplicates with a content hash. Discuss tradeoffs between I/O-bound and CPU-bound work, how to process very large files, how to scale to huge numbers of files, and how to support near-real-time detection when files are added or modified.
Quick Answer: This question evaluates competencies in concurrent programming, web crawling and graph traversal, thread safety, rate limiting, I/O-efficient file deduplication using hashing, and scalable system design within the Coding & Algorithms domain.