Design an image crawler for unlimited URLs
Company: Atlassian
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
Design a service that crawls images starting from a set of root URLs.
Requirements:
- Input: one or more root URLs.
- Crawl pages, discover links, and download image resources.
- Support **unlimited number of root URLs** and **unlimited crawl depth**.
- Must handle failures (network errors, timeouts, crashes) and avoid re-crawling the same URL excessively.
- Discuss storage for downloaded images and metadata.
Deliverables:
- High-level architecture (components, data flow).
- Queue/scheduler design and politeness (per-host rate limiting).
- Deduplication strategy.
- DB schema for crawl state and results.
- Failure/retry model and monitoring.
Quick Answer: This question evaluates a candidate's ability to design scalable, fault-tolerant distributed systems for web crawling, covering competencies in concurrency, queueing and scheduling, deduplication, storage architecture, and observability.