System Design: Image-URL Crawler (URLs only, no HTML storage)
Context
Design a production web crawler that fetches HTML pages and extracts only image URLs. Do not store full HTML bodies. Sources of image URLs include:
-
<img src="...">
-
<source srcset="...">
within
<picture>
-
Inline CSS styles (e.g., style="background-image: url('...')")
Assume this crawler will run continuously at scale and must support query APIs.
Requirements
-
High-level architecture
-
URL frontier/scheduler
-
Fetchers
-
Parsers
-
Deduplication
-
Storage and indexing
-
Control plane
-
Crawl politeness and compliance
-
robots.txt handling
-
Per-host rate limiting
-
Retries/backoff
-
User-agent identification
-
Canonicalization and URL normalization
-
Avoiding traps
-
Parsing at scale
-
Streaming parsers
-
Charset handling
-
Content-type verification
-
Managing redirects
-
Deduplication strategies
-
Normalized URL keys
-
Hash-based dedupe of image content or headers
-
Handling srcset and relative URLs
-
Storage design and schema
-
For images and page–image relationships
-
Propose DB choices: key-value for frontier, document/column store for metadata, object store if you later fetch binaries for validation
-
Query and API design
-
Endpoints to list images by domain, by crawl time, by MIME type
-
Pagination and filters
-
Sharding and scaling
-
Per-host queues
-
Consistent hashing
-
Horizontal scaling of fetchers/parsers
-
Fault tolerance and idempotency
-
At-least-once fetching
-
De-dup on write
-
Replay safety
-
Monitoring, metrics, and alerts
-
Crawl rate, error codes, robots denials, queue depth, unique image URL rate
-
Capacity planning
-
State assumptions and rough sizing
-
Data retention and privacy considerations