System Design: Concurrent Web Crawler (Threads)
You are asked to design and implement a basic web crawler that fetches pages concurrently using a thread executor. The crawler should be production-conscious (correctness, robustness, and observability) while remaining reasonably simple.
Requirements
-
Input
-
Accept one or more seed URLs.
-
Optional flag to restrict crawling to the same origin as the seeds (scheme, host, port).
-
Crawling behavior
-
Fetch pages concurrently using a thread executor with a configurable max worker count.
-
Cap crawl depth from each seed.
-
Extract links from HTML pages and enqueue newly discovered URLs.
-
Normalize and resolve links robustly (relative links, fragments, default ports, casing, etc.).
-
Avoid revisiting the same normalized URL (dedup across in-queue and visited).
-
Compliance and politeness
-
Respect robots.txt (allow/disallow rules per user-agent; cache per host; honor crawl-delay if present).
-
Per-host politeness/rate limiting (e.g., at most 1 request per host per X seconds, configurable; honor Retry-After on 429/503).
-
Networking
-
Handle redirects (update to final URL; dedup on the normalized final URL).
-
Handle HTTP errors and timeouts gracefully (do not crash; backoff when appropriate).
-
Filter by content type (e.g., only text/html by default).
-
Data structures and strategy
-
Describe the frontier and visited set data structures.
-
Describe the duplicate detection strategy (including enqueued vs. fetched URLs and redirects).
-
Testing and monitoring
-
Explain how you would test the crawler (unit, integration, concurrency, and fault-injection tests).
-
Describe what you would monitor/measure in a real run (metrics, logs, alerts).
Deliverables
-
A brief architecture description and rationale.
-
Core algorithm and key components (pseudo-code or code sketch is fine).
-
Clear description of data structures and dedup logic.
-
Testing strategy and monitoring plan.