Web Crawler System Design (Onsite)
Problem
Design and implement a concurrent web crawler that:
-
Starts from a given URL.
-
Uses a provided interface to fetch outgoing links from a page.
-
Returns all pages under the same hostname as the starting URL.
Requirements
-
Do not revisit URLs; handle cycles safely.
-
Enforce a configurable global concurrency limit.
-
Ensure thread-safe deduplication.
-
Normalize URLs consistently before comparison/deduplication.
-
Be polite:
-
Respect robots.txt rules for a given user-agent.
-
Enforce rate limiting per host; consider Crawl-delay if present.
-
Robust error handling and retry policy.
-
Explain how you would test correctness and performance.
Given Interface (Assumed)
-
fetchOutgoingLinks(url: string) -> List[string]
-
Returns absolute or relative URLs found on the page at
url
.
-
May throw transient or permanent errors.
Assumptions
-
"Same hostname" means exact match of the host portion (no subdomains).
-
Both http and https may exist; treat them as distinct URLs, but only crawl those whose hostname matches the start URL's hostname.
-
Content parsing is handled by
fetchOutgoingLinks
; your crawler focuses on orchestration, deduplication, and policy.
-
A simplified, single-process design is sufficient (discuss how to extend/distribute if time permits).