PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates proficiency in breadth-first search and graph traversal concepts applied to web crawling, including handling duplicate links, hostname filtering, and tracking visited URLs.

  • medium
  • Nooks
  • Coding & Algorithms
  • Software Engineer

Implement a BFS web crawler

Company: Nooks

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

A web crawler starts from a URL and explores reachable pages level by level. You are given: - `start_url: string` - An interface `fetch_links(url: string) -> List[string]`, which returns all hyperlinks found on that page Write a function that returns all pages reachable from `start_url` that belong to the same hostname as `start_url`. Requirements: - Use breadth-first traversal. - Visit each URL at most once. - Ignore links whose hostname differs from the hostname of `start_url`. - `fetch_links` may return duplicate links. - You may return the result in BFS visitation order or in any order, as long as all valid reachable pages are included exactly once. Possible follow-up discussion: - How do you avoid revisiting the same page? - What are the time and space complexities?

Quick Answer: This question evaluates proficiency in breadth-first search and graph traversal concepts applied to web crawling, including handling duplicate links, hostname filtering, and tracking visited URLs.

A web crawler starts from a URL and explores reachable pages level by level. In this problem, the external interface `fetch_links(url)` is simulated by a dictionary named `pages`, where `pages[url]` is the list of hyperlinks found on that page. If a URL does not appear in `pages`, treat it as a page with no outgoing links. Write a function that returns every page reachable from `start_url` that has the same hostname as `start_url`. Requirements: - Use breadth-first traversal. - Visit each URL at most once. - Ignore links whose hostname differs from the hostname of `start_url`. - `pages[url]` may contain duplicate links. - For deterministic grading, return the URLs in BFS visitation order, processing links in the order they appear in `pages[url]`.

Constraints

  • 1 <= len(start_url) <= 300
  • 0 <= len(pages) <= 10^4
  • The total number of links across all lists in `pages` is at most 10^5
  • All URLs are well-formed absolute URLs starting with `http://` or `https://`

Hints

  1. A queue is the standard way to process nodes level by level in breadth-first search.
  2. Use a set to mark URLs as visited when you enqueue them, and compare each link's hostname to the hostname of `start_url`.
Last updated: Jun 12, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.