Design and implement a web crawler that, given a starting URL and an interface to fetch outgoing links, returns all pages under the same hostname. Avoid revisiting URLs, handle cycles, and respect a configurable concurrency limit. Explain how you ensure thread-safe deduplication, URL normalization, politeness (rate limiting and robots rules), error handling, and how you would test correctness and performance.

This question evaluates skills in concurrent systems and orchestration, including thread-safe deduplication, URL normalization, global and per-host rate limiting, robots.txt compliance, error handling, and scalability trade-offs.

How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Onsite rounds at Snowflake.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Snowflake during technical interviews.

Design a concurrent web crawler | Snowflake Interview Question

Web Crawler System Design (Onsite)

Problem

Design and implement a concurrent web crawler that:

Starts from a given URL.
Uses a provided interface to fetch outgoing links from a page.
Returns all pages under the same hostname as the starting URL.

Requirements

Do not revisit URLs; handle cycles safely.
Enforce a configurable global concurrency limit.
Ensure thread-safe deduplication.
Normalize URLs consistently before comparison/deduplication.
Be polite:
- Respect robots.txt rules for a given user-agent.
- Enforce rate limiting per host; consider Crawl-delay if present.
Robust error handling and retry policy.
Explain how you would test correctness and performance.

Given Interface (Assumed)

fetchOutgoingLinks(url: string) -> List[string]
- Returns absolute or relative URLs found on the page at url .
- May throw transient or permanent errors.

Assumptions

"Same hostname" means exact match of the host portion (no subdomains).
Both http and https may exist; treat them as distinct URLs, but only crawl those whose hostname matches the start URL's hostname.
Content parsing is handled by fetchOutgoingLinks ; your crawler focuses on orchestration, deduplication, and policy.
A simplified, single-process design is sufficient (discuss how to extend/distribute if time permits).

Requirements

Do not revisit URLs; handle cycles safely.

Enforce a configurable global concurrency limit.

Ensure thread-safe deduplication.

Normalize URLs consistently before comparison/deduplication.

Be polite:

Respect robots.txt rules for a given user-agent.
Enforce rate limiting per host; consider Crawl-delay if present.

Robust error handling and retry policy.

Explain how you would test correctness and performance.

Assumptions

"Same hostname" means exact match of the host portion (no subdomains).

Both http and https may exist; treat them as distinct URLs, but only crawl those whose hostname matches the start URL's hostname.

Content parsing is handled by fetchOutgoingLinks ; your crawler focuses on orchestration, deduplication, and policy.

A simplified, single-process design is sufficient (discuss how to extend/distribute if time permits).

Design a concurrent web crawler

Quick Overview

Web Crawler System Design (Onsite)

Problem

Requirements

Given Interface (Assumed)

Assumptions

Solution

Comments (0)

Design a concurrent web crawler

Quick Overview

Web Crawler System Design (Onsite)

Problem

Requirements

Given Interface (Assumed)

Assumptions

Solution

Comments (0)