Design crawler storing only image URLs
Company: Atlassian
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Onsite
Design a web crawler that extracts and stores only image URLs from HTML pages (e.g., <img src>, <source srcset>, CSS background-image within inline styles) but does not store full HTML bodies. Cover:
1) High-level architecture (URL frontier, fetchers, parsers, deduplication, storage, indexing, and a control plane).
2) Crawl politeness and compliance (robots.txt, per-host rate limiting, retries/backoff, user-agent, canonicalization, URL normalization, avoiding traps).
3) Parsing at scale (streaming parsers, charset handling, content-type verification, managing redirects).
4) Deduplication strategies (normalized URL keys, hash-based dedupe of image content or headers, handling srcset and relative URLs).
5) Storage design and schema for images and page-image relationships; propose DB choices (e.g., key-value for frontier, document/column store for metadata, object store for images if you later choose to fetch binary for validation).
6) Query and API design: endpoints to list images by domain, by crawl time, by MIME type; pagination and filters.
7) Sharding and scaling (per-host queues, consistent hashing, horizontal scaling of fetchers/parsers).
8) Fault tolerance and idempotency (at-least-once fetching, de-dup on write, replay safety).
9) Monitoring, metrics, and alerts (crawl rate, error codes, robots denials, queue depth, unique image URL rate).
10) Capacity planning assumptions and rough sizing; discuss data retention and privacy considerations.
Quick Answer: This question evaluates proficiency in designing scalable, fault-tolerant web crawlers and related competencies such as HTML parsing, URL normalization, deduplication, storage and indexing, and query API design, and it belongs to the System Design domain.