Design a concurrent web crawler
Company: Snowflake
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Onsite
Design and implement a web crawler that, given a starting URL and an interface to fetch outgoing links, returns all pages under the same hostname. Avoid revisiting URLs, handle cycles, and respect a configurable concurrency limit. Explain how you ensure thread-safe deduplication, URL normalization, politeness (rate limiting and robots rules), error handling, and how you would test correctness and performance.
Quick Answer: This question evaluates skills in concurrent systems and orchestration, including thread-safe deduplication, URL normalization, global and per-host rate limiting, robots.txt compliance, error handling, and scalability trade-offs.