PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates a candidate's ability to implement graph traversal algorithms (BFS and DFS), manage visited-state to prevent cycles and duplicates, and reason about recursion limits, immutable versus mutable state, error handling, and concurrency concerns in a web-crawling context.

  • medium
  • Google
  • Coding & Algorithms
  • Machine Learning Engineer

Implement a Web Crawler with BFS and DFS

Company: Google

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a simple web crawler in Python. You are given: - A starting URL. - A function `get_links(url) -> list[str]` that returns all outgoing links from a URL. - A limit `max_pages`, the maximum number of pages to visit. Implement both: 1. A depth-first-search crawler. 2. A breadth-first-search crawler. Your crawler should: - Visit each URL at most once. - Stop after visiting `max_pages` pages. - Avoid infinite loops caused by cycles in the link graph. - Be robust to duplicate links. - Clearly define the order in which URLs are visited. Follow-up discussion topics: - What bugs can occur if the crawler uses recursion for DFS? - How can Python's recursion stack limit affect the implementation? - Why should the visited set store immutable, hashable URL strings rather than mutable objects? - How would you avoid bugs caused by shared mutable instance state across multiple crawler runs? - How would you extend the design to handle timeouts, failed requests, and concurrent crawling?

Quick Answer: This question evaluates a candidate's ability to implement graph traversal algorithms (BFS and DFS), manage visited-state to prevent cycles and duplicates, and reason about recursion limits, immutable versus mutable state, error handling, and concurrency concerns in a web-crawling context.

Part 1: Implement a Depth-First Web Crawler

You are given a directed graph of web pages, where each key is a URL string and its value is a list of outgoing links from that page. Also given are a starting URL and an integer max_pages. Write a function that simulates a depth-first web crawler and returns the URLs in the exact order they are visited. Rules: - Visit each URL at most once. - Stop as soon as max_pages URLs have been visited. - Handle cycles without looping forever. - Ignore duplicate links to the same URL. - If a URL does not appear as a key in the graph, treat it as a page with no outgoing links. - The start_url counts as the first visited page if max_pages > 0. Traversal order must be deterministic: - Use DFS with an explicit stack. - When expanding a page, its outgoing links should be visited in the same order they appear in the list. - Since a stack is LIFO, this means you should push neighbors onto the stack in reverse order.

Constraints

  • 0 <= max_pages <= 100000
  • graph maps URL strings to lists of URL strings
  • Links may contain duplicates
  • The graph may contain cycles and self-loops
  • A URL missing from graph should be treated as having no outgoing links

Examples

Input: ({'A': ['B', 'C'], 'B': ['D', 'A'], 'C': ['D'], 'D': []}, 'A', 10)

Expected Output: ['A', 'B', 'D', 'C']

Explanation: DFS starts at A. With neighbor order preserved, it visits B before C, then D from B, and finally C. The cycle back to A is ignored.

Input: ({'A': ['B', 'B', 'C'], 'B': ['D', 'D'], 'C': [], 'D': []}, 'A', 3)

Expected Output: ['A', 'B', 'D']

Explanation: Duplicate links are ignored, and the crawl stops after 3 visited pages.

Input: ({'A': ['B'], 'B': ['A']}, 'A', 0)

Expected Output: []

Explanation: Edge case: when max_pages is 0, nothing is visited.

Input: ({'A': ['B']}, 'X', 2)

Expected Output: ['X']

Explanation: Edge case: start_url is not in the graph, so it is still visited once and has no outgoing links.

Hints

  1. Use a stack, not recursion, so you do not depend on Python's recursion limit.
  2. To make DFS follow the listed neighbor order, push neighbors onto the stack from right to left.

Part 2: Implement a Breadth-First Web Crawler

You are given a directed graph of web pages, where each key is a URL string and its value is a list of outgoing links from that page. Also given are a starting URL and an integer max_pages. Write a function that simulates a breadth-first web crawler and returns the URLs in the exact order they are visited. Rules: - Visit each URL at most once. - Stop as soon as max_pages URLs have been visited. - Handle cycles without looping forever. - Ignore duplicate links to the same URL. - If a URL does not appear as a key in the graph, treat it as a page with no outgoing links. - The start_url counts as the first visited page if max_pages > 0. Traversal order must be deterministic: - Use BFS with a queue. - When expanding a page, enqueue outgoing links in the same order they appear in the list.

Constraints

  • 0 <= max_pages <= 100000
  • graph maps URL strings to lists of URL strings
  • Links may contain duplicates
  • The graph may contain cycles and self-loops
  • A URL missing from graph should be treated as having no outgoing links

Examples

Input: ({'A': ['B', 'C'], 'B': ['D', 'A'], 'C': ['D', 'E'], 'D': [], 'E': []}, 'A', 10)

Expected Output: ['A', 'B', 'C', 'D', 'E']

Explanation: BFS visits all nodes one level at a time: first A, then B and C, then D and E.

Input: ({'A': ['B', 'B', 'C'], 'B': ['D'], 'C': ['D', 'E'], 'D': [], 'E': []}, 'A', 4)

Expected Output: ['A', 'B', 'C', 'D']

Explanation: Duplicate links are ignored, and the crawl stops after 4 visited pages.

Input: ({'A': ['B'], 'B': ['A']}, 'A', 0)

Expected Output: []

Explanation: Edge case: when max_pages is 0, nothing is visited.

Input: ({}, 'X', 3)

Expected Output: ['X']

Explanation: Edge case: an empty graph still allows the starting URL to be visited once.

Hints

  1. Use a queue to process pages level by level from the starting URL.
  2. Mark a URL as seen when you enqueue it, not after you dequeue it, to avoid duplicates entering the queue multiple times.
Last updated: May 23, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Solve Three Array and Matrix Path Problems - Google (medium)
  • Solve Flower Placement and Directory Deletion - Google (medium)
  • Simulate In-Place Cellular State Updates - Google (hard)
  • Compute Turnstile Crossing Times - Google (hard)
  • Determine Whether a Word Exists in a Graph - Google (medium)