PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of breadth-first traversal, URL normalization, domain scoping, duplicate detection, and retry/error handling, measuring algorithmic design and robustness in networked code.

  • medium
  • Amazon
  • Coding & Algorithms
  • Software Engineer

Build a BFS Web Crawler

Company: Amazon

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a web crawler for a single website. You are given: - A starting URL `startUrl` - A function `fetch(url)` that returns the HTML content of a page or raises an error - A function `extractLinks(html)` that returns all links found on the page Write a crawler that: 1. Traverses pages in breadth-first order starting from `startUrl`. 2. Only visits pages in the same domain as `startUrl`. 3. Normalizes URLs by resolving relative paths, removing fragments, and treating equivalent URLs as the same page. 4. Retries transient fetch failures up to 3 times before giving up. 5. Skips permanently failed pages. 6. Never visits the same normalized URL more than once. Return the list of visited URLs in BFS order. In your implementation, be prepared to explain your queue and visited-set design, how you detect duplicate URLs, and how retry logic interacts with the crawl loop.

Quick Answer: This question evaluates understanding of breadth-first traversal, URL normalization, domain scoping, duplicate detection, and retry/error handling, measuring algorithmic design and robustness in networked code.

Implement a breadth-first web crawler for a single website. To keep the problem self-contained, real network calls are simulated. You are given: - `startUrl`: the URL where the crawl begins - `pages`: a dictionary where `pages[normalized_url]` is the list of raw links found on that page, in document order - `failures`: a dictionary describing fetch behavior for each normalized URL Your crawler must: 1. Traverse pages in breadth-first order starting from `startUrl`. 2. Only crawl pages whose hostname exactly matches the hostname of `startUrl` (case-insensitive). 3. Normalize every discovered URL by: - resolving relative links against the current page - removing fragments (`#...`) - lowercasing scheme and host - removing default ports (`:80` for `http`, `:443` for `https`) - normalizing `.` and `..` path segments - treating an empty path as `/` 4. Retry transient fetch failures up to 3 times before giving up. That means a URL can be attempted at most 4 times total: 1 initial attempt + 3 retries. 5. Skip permanently failed pages. 6. Never process the same normalized URL more than once, even if it is discovered multiple times. Fetch simulation rules: - If `failures[url]` is missing or `0`, fetching `url` succeeds immediately. - If `failures[url] = k` and `k > 0`, the first `k` fetch attempts fail transiently, and the next attempt succeeds. - If `failures[url] = -1`, the page fails permanently and never succeeds. - If a URL is not present in `pages`, treat it as an unfetchable page and skip it. Return the list of successfully visited normalized URLs in BFS order.

Constraints

  • `0 <= len(pages) <= 10^4`
  • The total number of raw links across all pages is at most `2 * 10^5`
  • Each URL length is at most 200 characters
  • `failures[url]` is either `-1` or a non-negative integer
  • All keys in `pages` and `failures` refer to normalized absolute URLs

Examples

Input: ('https://Example.com:443/a/index.html#top', {'https://example.com/a/index.html': ['../about', '/contact#team', 'https://other.com/home', '../about#bio', '/a/./products'], 'https://example.com/about': ['team', '/contact', 'https://example.com:443/a/index.html'], 'https://example.com/contact': ['/missing', '/about#again'], 'https://example.com/a/products': [], 'https://example.com/team': []}, {'https://example.com/contact': 1, 'https://example.com/missing': -1})

Expected Output: ['https://example.com/a/index.html', 'https://example.com/about', 'https://example.com/contact', 'https://example.com/a/products', 'https://example.com/team']

Explanation: The crawler normalizes the start URL, ignores the external domain, deduplicates repeated links to /about, retries /contact once before succeeding, and skips /missing because it permanently fails.

Input: ('http://site.com', {'http://site.com/': ['/slow', '/ok', '/ok#frag'], 'http://site.com/slow': [], 'http://site.com/ok': ['/child'], 'http://site.com/child': []}, {'http://site.com/slow': 4})

Expected Output: ['http://site.com/', 'http://site.com/ok', 'http://site.com/child']

Explanation: The root page is visited first. /slow is discovered before /ok, but it needs 4 transient failures before success, which exceeds the retry limit of 3 retries, so it is skipped. /ok and then /child are visited.

Input: ('https://a.com/start', {'https://a.com/start': ['/x'], 'https://a.com/x': []}, {'https://a.com/start': -1})

Expected Output: []

Explanation: The start page permanently fails, so nothing is successfully visited.

Input: ('https://docs.site.com/guide/', {'https://docs.site.com/guide/': ['./intro', '../guide/./faq#top', 'https://docs.site.com:443/guide/intro', 'https://blog.site.com/post', '../guide/intro/../api'], 'https://docs.site.com/guide/intro': [], 'https://docs.site.com/guide/faq': [], 'https://docs.site.com/guide/api': []}, {})

Expected Output: ['https://docs.site.com/guide/', 'https://docs.site.com/guide/intro', 'https://docs.site.com/guide/faq', 'https://docs.site.com/guide/api']

Explanation: The crawler preserves BFS order, resolves relative links from a directory URL, removes fragments and default ports, ignores the different hostname blog.site.com, and visits each normalized URL once.

Hints

  1. Use a queue for BFS, and add a URL to the seen set when you enqueue it, not when you dequeue it.
  2. Normalize every link before both hostname checking and duplicate detection; otherwise equivalent URLs may be crawled more than once.
Last updated: Apr 19, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement Datacenter Router Commands - Amazon (hard)
  • Replace Delimited Tokens in a String - Amazon (medium)
  • Minimize Circular Redistribution Cost - Amazon (medium)
  • Find the Most Common Visit Pattern - Amazon (hard)
  • Maximize Value Under a Budget - Amazon (medium)