PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Coding & Algorithms/Amazon

Implement a high-throughput web crawler safely

Last updated: Mar 29, 2026

Quick Overview

This question evaluates competency in concurrent systems, scalable data ingestion, and algorithmic design for a high-throughput web crawler, covering URL normalization, deduplication, per-host rate limiting, fault-tolerant checkpointing, and bounded-memory queueing.

  • Medium
  • Amazon
  • Coding & Algorithms
  • Data Scientist

Implement a high-throughput web crawler safely

Company: Amazon

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Design and code (pseudocode acceptable) a multi-threaded web crawler that favors breadth-first discovery while continuously running analysis tasks on fetched pages. Constraints: 1) respect robots.txt and per-host rate limits; 2) dedupe URLs (including normalization and canonical redirects) with at-most-once processing; 3) bounded memory—spill queues to disk when needed; 4) back-pressure so analysis cannot starve crawling and vice versa; 5) graceful shutdown and exactly-once checkpointing on restart. Answer details: a) define your core data structures (frontier, seen-set, per-host token buckets) and their big-O behavior; b) show how you avoid deadlocks and priority inversion (e.g., work-stealing, fine-grained locks, or lock-free queues) and how you detect/handle thread-safety bugs; c) explain how BFS ordering degrades with many hosts and how you would approximate it; d) specify the metrics you would track to verify a 50%+ throughput gain (and how you’d run a controlled benchmark to prove it).

Quick Answer: This question evaluates competency in concurrent systems, scalable data ingestion, and algorithmic design for a high-throughput web crawler, covering URL normalization, deduplication, per-host rate limiting, fault-tolerant checkpointing, and bounded-memory queueing.

Related Interview Questions

  • Find Unique Target-Sum Pairs - Amazon (easy)
  • Find Valid IP Addresses in Files - Amazon (medium)
  • Implement Optimal Bucket Batching - Amazon (hard)
  • Implement Cache and Rotate Matrix - Amazon (medium)
  • Find Longest Activatable Server Streak - Amazon (hard)
Amazon logo
Amazon
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Coding & Algorithms
5
0

Design and code (pseudocode acceptable) a multi-threaded web crawler that favors breadth-first discovery while continuously running analysis tasks on fetched pages. Constraints: 1) respect robots.txt and per-host rate limits; 2) dedupe URLs (including normalization and canonical redirects) with at-most-once processing; 3) bounded memory—spill queues to disk when needed; 4) back-pressure so analysis cannot starve crawling and vice versa; 5) graceful shutdown and exactly-once checkpointing on restart. Answer details: a) define your core data structures (frontier, seen-set, per-host token buckets) and their big-O behavior; b) show how you avoid deadlocks and priority inversion (e.g., work-stealing, fine-grained locks, or lock-free queues) and how you detect/handle thread-safety bugs; c) explain how BFS ordering degrades with many hosts and how you would approximate it; d) specify the metrics you would track to verify a 50%+ throughput gain (and how you’d run a controlled benchmark to prove it).

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More Amazon•More Data Scientist•Amazon Data Scientist•Amazon Coding & Algorithms•Data Scientist Coding & Algorithms
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.