How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Amazon.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Amazon during technical interviews.

Build a BFS Web Crawler | Amazon Coding Question

Build a BFS Web Crawler

Company: Amazon

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a web crawler for a single website. You are given: - A starting URL `startUrl` - A function `fetch(url)` that returns the HTML content of a page or raises an error - A function `extractLinks(html)` that returns all links found on the page Write a crawler that: 1. Traverses pages in breadth-first order starting from `startUrl`. 2. Only visits pages in the same domain as `startUrl`. 3. Normalizes URLs by resolving relative paths, removing fragments, and treating equivalent URLs as the same page. 4. Retries transient fetch failures up to 3 times before giving up. 5. Skips permanently failed pages. 6. Never visits the same normalized URL more than once. Return the list of visited URLs in BFS order. In your implementation, be prepared to explain your queue and visited-set design, how you detect duplicate URLs, and how retry logic interacts with the crawl loop.

Quick Answer: This question evaluates understanding of breadth-first traversal, URL normalization, domain scoping, duplicate detection, and retry/error handling, measuring algorithmic design and robustness in networked code.

Implement a breadth-first web crawler for a single website. To keep the problem self-contained, real network calls are simulated. You are given: - `startUrl`: the URL where the crawl begins - `pages`: a dictionary where `pages[normalized_url]` is the list of raw links found on that page, in document order - `failures`: a dictionary describing fetch behavior for each normalized URL Your crawler must: 1. Traverse pages in breadth-first order starting from `startUrl`. 2. Only crawl pages whose hostname exactly matches the hostname of `startUrl` (case-insensitive). 3. Normalize every discovered URL by: - resolving relative links against the current page - removing fragments (`#...`) - lowercasing scheme and host - removing default ports (`:80` for `http`, `:443` for `https`) - normalizing `.` and `..` path segments - treating an empty path as `/` 4. Retry transient fetch failures up to 3 times before giving up. That means a URL can be attempted at most 4 times total: 1 initial attempt + 3 retries. 5. Skip permanently failed pages. 6. Never process the same normalized URL more than once, even if it is discovered multiple times. Fetch simulation rules: - If `failures[url]` is missing or `0`, fetching `url` succeeds immediately. - If `failures[url] = k` and `k > 0`, the first `k` fetch attempts fail transiently, and the next attempt succeeds. - If `failures[url] = -1`, the page fails permanently and never succeeds. - If a URL is not present in `pages`, treat it as an unfetchable page and skip it. Return the list of successfully visited normalized URLs in BFS order.

Constraints

`0 <= len(pages) <= 10^4`
The total number of raw links across all pages is at most `2 * 10^5`
Each URL length is at most 200 characters
`failures[url]` is either `-1` or a non-negative integer
All keys in `pages` and `failures` refer to normalized absolute URLs

Examples

Input: ('https://Example.com:443/a/index.html#top', {'https://example.com/a/index.html': ['../about', '/contact#team', 'https://other.com/home', '../about#bio', '/a/./products'], 'https://example.com/about': ['team', '/contact', 'https://example.com:443/a/index.html'], 'https://example.com/contact': ['/missing', '/about#again'], 'https://example.com/a/products': [], 'https://example.com/team': []}, {'https://example.com/contact': 1, 'https://example.com/missing': -1})

Expected Output: ['https://example.com/a/index.html', 'https://example.com/about', 'https://example.com/contact', 'https://example.com/a/products', 'https://example.com/team']

Explanation: The crawler normalizes the start URL, ignores the external domain, deduplicates repeated links to /about, retries /contact once before succeeding, and skips /missing because it permanently fails.

Input: ('http://site.com', {'http://site.com/': ['/slow', '/ok', '/ok#frag'], 'http://site.com/slow': [], 'http://site.com/ok': ['/child'], 'http://site.com/child': []}, {'http://site.com/slow': 4})

Expected Output: ['http://site.com/', 'http://site.com/ok', 'http://site.com/child']

Explanation: The root page is visited first. /slow is discovered before /ok, but it needs 4 transient failures before success, which exceeds the retry limit of 3 retries, so it is skipped. /ok and then /child are visited.

Input: ('https://a.com/start', {'https://a.com/start': ['/x'], 'https://a.com/x': []}, {'https://a.com/start': -1})

Expected Output: []

Explanation: The start page permanently fails, so nothing is successfully visited.

Input: ('https://docs.site.com/guide/', {'https://docs.site.com/guide/': ['./intro', '../guide/./faq#top', 'https://docs.site.com:443/guide/intro', 'https://blog.site.com/post', '../guide/intro/../api'], 'https://docs.site.com/guide/intro': [], 'https://docs.site.com/guide/faq': [], 'https://docs.site.com/guide/api': []}, {})

Expected Output: ['https://docs.site.com/guide/', 'https://docs.site.com/guide/intro', 'https://docs.site.com/guide/faq', 'https://docs.site.com/guide/api']

Explanation: The crawler preserves BFS order, resolves relative links from a directory URL, removes fragments and default ports, ignores the different hostname blog.site.com, and visits each normalized URL once.

Hints

Use a queue for BFS, and add a URL to the seen set when you enqueue it, not when you dequeue it.
Normalize every link before both hostname checking and duplicate detection; otherwise equivalent URLs may be crawled more than once.

Quick Overview

This question evaluates understanding of breadth-first traversal, URL normalization, domain scoping, duplicate detection, and retry/error handling, measuring algorithmic design and robustness in networked code.