PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/System Design/Snowflake

Design a concurrent web crawler

Last updated: Mar 29, 2026

Quick Overview

This question evaluates skills in concurrent systems and orchestration, including thread-safe deduplication, URL normalization, global and per-host rate limiting, robots.txt compliance, error handling, and scalability trade-offs.

  • hard
  • Snowflake
  • System Design
  • Software Engineer

Design a concurrent web crawler

Company: Snowflake

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

Design and implement a web crawler that, given a starting URL and an interface to fetch outgoing links, returns all pages under the same hostname. Avoid revisiting URLs, handle cycles, and respect a configurable concurrency limit. Explain how you ensure thread-safe deduplication, URL normalization, politeness (rate limiting and robots rules), error handling, and how you would test correctness and performance.

Quick Answer: This question evaluates skills in concurrent systems and orchestration, including thread-safe deduplication, URL normalization, global and per-host rate limiting, robots.txt compliance, error handling, and scalability trade-offs.

Related Interview Questions

  • Design a Cron Job Scheduler - Snowflake (medium)
  • Design a disk-backed KV store under contention - Snowflake (easy)
  • Design an ACL authorization checking service - Snowflake (hard)
  • Design an object store with deduplication - Snowflake (medium)
  • Design a distributed system end-to-end - Snowflake (hard)
Snowflake logo
Snowflake
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
System Design
9
0

Web Crawler System Design (Onsite)

Problem

Design and implement a concurrent web crawler that:

  • Starts from a given URL.
  • Uses a provided interface to fetch outgoing links from a page.
  • Returns all pages under the same hostname as the starting URL.

Requirements

  1. Do not revisit URLs; handle cycles safely.
  2. Enforce a configurable global concurrency limit.
  3. Ensure thread-safe deduplication.
  4. Normalize URLs consistently before comparison/deduplication.
  5. Be polite:
    • Respect robots.txt rules for a given user-agent.
    • Enforce rate limiting per host; consider Crawl-delay if present.
  6. Robust error handling and retry policy.
  7. Explain how you would test correctness and performance.

Given Interface (Assumed)

  • fetchOutgoingLinks(url: string) -> List[string]
    • Returns absolute or relative URLs found on the page at url .
    • May throw transient or permanent errors.

Assumptions

  • "Same hostname" means exact match of the host portion (no subdomains).
  • Both http and https may exist; treat them as distinct URLs, but only crawl those whose hostname matches the start URL's hostname.
  • Content parsing is handled by fetchOutgoingLinks ; your crawler focuses on orchestration, deduplication, and policy.
  • A simplified, single-process design is sufficient (discuss how to extend/distribute if time permits).

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Snowflake•More Software Engineer•Snowflake Software Engineer•Snowflake System Design•Software Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.