PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Atlassian

Design crawler storing only image URLs

Last updated: Mar 29, 2026

Quick Overview

This question evaluates proficiency in designing scalable, fault-tolerant web crawlers and related competencies such as HTML parsing, URL normalization, deduplication, storage and indexing, and query API design, and it belongs to the System Design domain.

  • hard
  • Atlassian
  • System Design
  • Software Engineer

Design crawler storing only image URLs

Company: Atlassian

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

Design a web crawler that extracts and stores only image URLs from HTML pages (e.g., <img src>, <source srcset>, CSS background-image within inline styles) but does not store full HTML bodies. Cover: 1) High-level architecture (URL frontier, fetchers, parsers, deduplication, storage, indexing, and a control plane). 2) Crawl politeness and compliance (robots.txt, per-host rate limiting, retries/backoff, user-agent, canonicalization, URL normalization, avoiding traps). 3) Parsing at scale (streaming parsers, charset handling, content-type verification, managing redirects). 4) Deduplication strategies (normalized URL keys, hash-based dedupe of image content or headers, handling srcset and relative URLs). 5) Storage design and schema for images and page-image relationships; propose DB choices (e.g., key-value for frontier, document/column store for metadata, object store for images if you later choose to fetch binary for validation). 6) Query and API design: endpoints to list images by domain, by crawl time, by MIME type; pagination and filters. 7) Sharding and scaling (per-host queues, consistent hashing, horizontal scaling of fetchers/parsers). 8) Fault tolerance and idempotency (at-least-once fetching, de-dup on write, replay safety). 9) Monitoring, metrics, and alerts (crawl rate, error codes, robots denials, queue depth, unique image URL rate). 10) Capacity planning assumptions and rough sizing; discuss data retention and privacy considerations.

Quick Answer: This question evaluates proficiency in designing scalable, fault-tolerant web crawlers and related competencies such as HTML parsing, URL normalization, deduplication, storage and indexing, and query API design, and it belongs to the System Design domain.

Related Interview Questions

  • Design a distributed rate limiter service - Atlassian (medium)
  • Design a simple greeting-card web app - Atlassian (medium)
  • Design a Data Stream Processor - Atlassian (easy)
  • Design a scalable chatbot platform - Atlassian (medium)
  • Diagnose why a scaled system became slow - Atlassian (medium)
Atlassian logo
Atlassian
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
System Design
9
0

System Design: Image-URL Crawler (URLs only, no HTML storage)

Context

Design a production web crawler that fetches HTML pages and extracts only image URLs. Do not store full HTML bodies. Sources of image URLs include:

  • <img src="...">
  • <source srcset="..."> within <picture>
  • Inline CSS styles (e.g., style="background-image: url('...')")

Assume this crawler will run continuously at scale and must support query APIs.

Requirements

  1. High-level architecture
    • URL frontier/scheduler
    • Fetchers
    • Parsers
    • Deduplication
    • Storage and indexing
    • Control plane
  2. Crawl politeness and compliance
    • robots.txt handling
    • Per-host rate limiting
    • Retries/backoff
    • User-agent identification
    • Canonicalization and URL normalization
    • Avoiding traps
  3. Parsing at scale
    • Streaming parsers
    • Charset handling
    • Content-type verification
    • Managing redirects
  4. Deduplication strategies
    • Normalized URL keys
    • Hash-based dedupe of image content or headers
    • Handling srcset and relative URLs
  5. Storage design and schema
    • For images and page–image relationships
    • Propose DB choices: key-value for frontier, document/column store for metadata, object store if you later fetch binaries for validation
  6. Query and API design
    • Endpoints to list images by domain, by crawl time, by MIME type
    • Pagination and filters
  7. Sharding and scaling
    • Per-host queues
    • Consistent hashing
    • Horizontal scaling of fetchers/parsers
  8. Fault tolerance and idempotency
    • At-least-once fetching
    • De-dup on write
    • Replay safety
  9. Monitoring, metrics, and alerts
    • Crawl rate, error codes, robots denials, queue depth, unique image URL rate
  10. Capacity planning
  • State assumptions and rough sizing
  • Data retention and privacy considerations

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Atlassian•More Software Engineer•Atlassian Software Engineer•Atlassian System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.