Design a concurrent web crawler

Q: Design a concurrent web crawler

This is a System Design interview question from Snowflake for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Web Crawler System Design (Onsite)

Problem

Design and implement a concurrent web crawler that:

Starts from a given URL.
Uses a provided interface to fetch outgoing links from a page.
Returns all pages under the same hostname as the starting URL.

Requirements

Do not revisit URLs; handle cycles safely.
Enforce a configurable global concurrency limit.
Ensure thread-safe deduplication.
Normalize URLs consistently before comparison/deduplication.
Be polite:
- Respect robots.txt rules for a given user-agent.
- Enforce rate limiting per host; consider Crawl-delay if present.
Robust error handling and retry policy.
Explain how you would test correctness and performance.

Given Interface (Assumed)

fetchOutgoingLinks(url: string) -> List[string]
- Returns absolute or relative URLs found on the page at url .
- May throw transient or permanent errors.

Assumptions

"Same hostname" means exact match of the host portion (no subdomains).
Both http and https may exist; treat them as distinct URLs, but only crawl those whose hostname matches the start URL's hostname.
Content parsing is handled by fetchOutgoingLinks ; your crawler focuses on orchestration, deduplication, and policy.
A simplified, single-process design is sufficient (discuss how to extend/distribute if time permits).