Design crawler storing only image URLs

Q: Design crawler storing only image URLs

This is a System Design interview question from Atlassian for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Image-URL Crawler (URLs only, no HTML storage)

Context

Design a production web crawler that fetches HTML pages and extracts only image URLs. Do not store full HTML bodies. Sources of image URLs include:

<img src="...">
<source srcset="..."> within <picture>
Inline CSS styles (e.g., style="background-image: url('...')")

Assume this crawler will run continuously at scale and must support query APIs.

Requirements

High-level architecture
- URL frontier/scheduler
- Fetchers
- Parsers
- Deduplication
- Storage and indexing
- Control plane
Crawl politeness and compliance
- robots.txt handling
- Per-host rate limiting
- Retries/backoff
- User-agent identification
- Canonicalization and URL normalization
- Avoiding traps
Parsing at scale
- Streaming parsers
- Charset handling
- Content-type verification
- Managing redirects
Deduplication strategies
- Normalized URL keys
- Hash-based dedupe of image content or headers
- Handling srcset and relative URLs
Storage design and schema
- For images and page–image relationships
- Propose DB choices: key-value for frontier, document/column store for metadata, object store if you later fetch binaries for validation
Query and API design
- Endpoints to list images by domain, by crawl time, by MIME type
- Pagination and filters
Sharding and scaling
- Per-host queues
- Consistent hashing
- Horizontal scaling of fetchers/parsers
Fault tolerance and idempotency
- At-least-once fetching
- De-dup on write
- Replay safety
Monitoring, metrics, and alerts
- Crawl rate, error codes, robots denials, queue depth, unique image URL rate
Capacity planning

State assumptions and rough sizing
Data retention and privacy considerations

Design crawler storing only image URLs

System Design: Image-URL Crawler (URLs only, no HTML storage)

Context

Requirements

Solution (Locked)

Comments (0)