Implement crawler and file deduplication
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
Quick Answer: This question evaluates competencies in concurrent programming, web crawling and graph traversal, thread safety, rate limiting, I/O-efficient file deduplication using hashing, and scalable system design within the Coding & Algorithms domain.
Part 1: Single-Threaded Same-Domain Web Crawler
Constraints
- 0 <= len(pages) <= 10^4
- The total number of links across all page lists is at most 10^5
- All URLs are absolute URLs
- Same-domain comparison uses the URL netloc only; scheme may differ, but subdomains count as different domains
Examples
Input: ("https://example.com", {"https://example.com": ["https://example.com/about", "https://example.com/contact", "https://other.com/skip"], "https://example.com/about": ["https://example.com/team", "https://example.com"], "https://example.com/contact": ["https://example.com/team", "https://other.com/x"], "https://example.com/team": []})
Expected Output: ["https://example.com", "https://example.com/about", "https://example.com/contact", "https://example.com/team"]
Explanation: Standard crawl with a cycle and external links. Only same-domain URLs are followed, and each page is visited once.