How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Google.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Google during technical interviews.

Implement a Web Crawler with BFS and DFS

Quick Overview

This question evaluates a candidate's ability to implement graph traversal algorithms (BFS and DFS), manage visited-state to prevent cycles and duplicates, and reason about recursion limits, immutable versus mutable state, error handling, and concurrency concerns in a web-crawling context.

Company: Google

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a simple web crawler in Python. You are given: - A starting URL. - A function `get_links(url) -> list[str]` that returns all outgoing links from a URL. - A limit `max_pages`, the maximum number of pages to visit. Implement both: 1. A depth-first-search crawler. 2. A breadth-first-search crawler. Your crawler should: - Visit each URL at most once. - Stop after visiting `max_pages` pages. - Avoid infinite loops caused by cycles in the link graph. - Be robust to duplicate links. - Clearly define the order in which URLs are visited. Follow-up discussion topics: - What bugs can occur if the crawler uses recursion for DFS? - How can Python's recursion stack limit affect the implementation? - Why should the visited set store immutable, hashable URL strings rather than mutable objects? - How would you avoid bugs caused by shared mutable instance state across multiple crawler runs? - How would you extend the design to handle timeouts, failed requests, and concurrent crawling?

Quick Answer: This question evaluates a candidate's ability to implement graph traversal algorithms (BFS and DFS), manage visited-state to prevent cycles and duplicates, and reason about recursion limits, immutable versus mutable state, error handling, and concurrency concerns in a web-crawling context.

Part 1: Implement a Depth-First Web Crawler

You are given a directed graph of web pages, where each key is a URL string and its value is a list of outgoing links from that page. Also given are a starting URL and an integer max_pages. Write a function that simulates a depth-first web crawler and returns the URLs in the exact order they are visited. Rules: - Visit each URL at most once. - Stop as soon as max_pages URLs have been visited. - Handle cycles without looping forever. - Ignore duplicate links to the same URL. - If a URL does not appear as a key in the graph, treat it as a page with no outgoing links. - The start_url counts as the first visited page if max_pages > 0. Traversal order must be deterministic: - Use DFS with an explicit stack. - When expanding a page, its outgoing links should be visited in the same order they appear in the list. - Since a stack is LIFO, this means you should push neighbors onto the stack in reverse order.

Constraints

0 <= max_pages <= 100000
graph maps URL strings to lists of URL strings
Links may contain duplicates
The graph may contain cycles and self-loops
A URL missing from graph should be treated as having no outgoing links

Examples

Input: ({'A': ['B', 'C'], 'B': ['D', 'A'], 'C': ['D'], 'D': []}, 'A', 10)

Expected Output: ['A', 'B', 'D', 'C']

Explanation: DFS starts at A. With neighbor order preserved, it visits B before C, then D from B, and finally C. The cycle back to A is ignored.

Input: ({'A': ['B', 'B', 'C'], 'B': ['D', 'D'], 'C': [], 'D': []}, 'A', 3)

Expected Output: ['A', 'B', 'D']

Explanation: Duplicate links are ignored, and the crawl stops after 3 visited pages.

Hints

Use a stack, not recursion, so you do not depend on Python's recursion limit.
To make DFS follow the listed neighbor order, push neighbors onto the stack from right to left.

Part 2: Implement a Breadth-First Web Crawler

You are given a directed graph of web pages, where each key is a URL string and its value is a list of outgoing links from that page. Also given are a starting URL and an integer max_pages. Write a function that simulates a breadth-first web crawler and returns the URLs in the exact order they are visited. Rules: - Visit each URL at most once. - Stop as soon as max_pages URLs have been visited. - Handle cycles without looping forever. - Ignore duplicate links to the same URL. - If a URL does not appear as a key in the graph, treat it as a page with no outgoing links. - The start_url counts as the first visited page if max_pages > 0. Traversal order must be deterministic: - Use BFS with a queue. - When expanding a page, enqueue outgoing links in the same order they appear in the list.

Constraints

0 <= max_pages <= 100000
graph maps URL strings to lists of URL strings
Links may contain duplicates
The graph may contain cycles and self-loops
A URL missing from graph should be treated as having no outgoing links

Examples

Input: ({'A': ['B', 'C'], 'B': ['D', 'A'], 'C': ['D', 'E'], 'D': [], 'E': []}, 'A', 10)

Expected Output: ['A', 'B', 'C', 'D', 'E']

Explanation: BFS visits all nodes one level at a time: first A, then B and C, then D and E.

Input: ({'A': ['B', 'B', 'C'], 'B': ['D'], 'C': ['D', 'E'], 'D': [], 'E': []}, 'A', 4)

Expected Output: ['A', 'B', 'C', 'D']

Explanation: Duplicate links are ignored, and the crawl stops after 4 visited pages.

Hints

Use a queue to process pages level by level from the starting URL.
Mark a URL as seen when you enqueue it, not after you dequeue it, to avoid duplicates entering the queue multiple times.

Quick Overview