Implement a BFS web crawler
Company: Nooks
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
A web crawler starts from a URL and explores reachable pages level by level.
You are given:
- `start_url: string`
- An interface `fetch_links(url: string) -> List[string]`, which returns all hyperlinks found on that page
Write a function that returns all pages reachable from `start_url` that belong to the same hostname as `start_url`.
Requirements:
- Use breadth-first traversal.
- Visit each URL at most once.
- Ignore links whose hostname differs from the hostname of `start_url`.
- `fetch_links` may return duplicate links.
- You may return the result in BFS visitation order or in any order, as long as all valid reachable pages are included exactly once.
Possible follow-up discussion:
- How do you avoid revisiting the same page?
- What are the time and space complexities?
Quick Answer: This question evaluates proficiency in breadth-first search and graph traversal concepts applied to web crawling, including handling duplicate links, hostname filtering, and tracking visited URLs.
A web crawler starts from a URL and explores reachable pages level by level.
In this problem, the external interface `fetch_links(url)` is simulated by a dictionary named `pages`, where `pages[url]` is the list of hyperlinks found on that page. If a URL does not appear in `pages`, treat it as a page with no outgoing links.
Write a function that returns every page reachable from `start_url` that has the same hostname as `start_url`.
Requirements:
- Use breadth-first traversal.
- Visit each URL at most once.
- Ignore links whose hostname differs from the hostname of `start_url`.
- `pages[url]` may contain duplicate links.
- For deterministic grading, return the URLs in BFS visitation order, processing links in the order they appear in `pages[url]`.
Constraints
- 1 <= len(start_url) <= 300
- 0 <= len(pages) <= 10^4
- The total number of links across all lists in `pages` is at most 10^5
- All URLs are well-formed absolute URLs starting with `http://` or `https://`
Hints
- A queue is the standard way to process nodes level by level in breadth-first search.
- Use a set to mark URLs as visited when you enqueue them, and compare each link's hostname to the hostname of `start_url`.