Design crawler storing only image URLs

Q: Design crawler storing only image URLs

This question evaluates proficiency in designing scalable, fault-tolerant web crawlers and related competencies such as HTML parsing, URL normalization, deduplication, storage and indexing, and query API design, and it belongs to the System Design domain.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Onsite rounds at Atlassian.

Q: What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Atlassian during technical interviews.

Question

System Design: Image-URL Crawler (URLs only, no HTML storage)

Context

Design a production web crawler that fetches HTML pages and extracts only image URLs. Do not store full HTML bodies. Sources of image URLs include:

<img src="...">
<source srcset="..."> within <picture>
Inline CSS styles (e.g., style="background-image: url('...')")

Assume this crawler will run continuously at scale and must support query APIs.

Requirements

High-level architecture
- URL frontier/scheduler
- Fetchers
- Parsers
- Deduplication
- Storage and indexing
- Control plane
Crawl politeness and compliance
- robots.txt handling
- Per-host rate limiting
- Retries/backoff
- User-agent identification
- Canonicalization and URL normalization
- Avoiding traps
Parsing at scale
- Streaming parsers
- Charset handling
- Content-type verification
- Managing redirects
Deduplication strategies
- Normalized URL keys
- Hash-based dedupe of image content or headers
- Handling srcset and relative URLs
Storage design and schema
- For images and page–image relationships
- Propose DB choices: key-value for frontier, document/column store for metadata, object store if you later fetch binaries for validation
Query and API design
- Endpoints to list images by domain, by crawl time, by MIME type
- Pagination and filters
Sharding and scaling
- Per-host queues
- Consistent hashing
- Horizontal scaling of fetchers/parsers
Fault tolerance and idempotency
- At-least-once fetching
- De-dup on write
- Replay safety
Monitoring, metrics, and alerts
- Crawl rate, error codes, robots denials, queue depth, unique image URL rate
Capacity planning

State assumptions and rough sizing
Data retention and privacy considerations

Design crawler storing only image URLs

Quick Overview

System Design: Image-URL Crawler (URLs only, no HTML storage)

Context

Requirements

Solution

Comments (0)