Implement a Web Crawler with BFS and DFS
Company: Google
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's ability to implement graph traversal algorithms (BFS and DFS), manage visited-state to prevent cycles and duplicates, and reason about recursion limits, immutable versus mutable state, error handling, and concurrency concerns in a web-crawling context.
Part 1: Implement a Depth-First Web Crawler
Constraints
- 0 <= max_pages <= 100000
- graph maps URL strings to lists of URL strings
- Links may contain duplicates
- The graph may contain cycles and self-loops
- A URL missing from graph should be treated as having no outgoing links
Examples
Input: ({'A': ['B', 'C'], 'B': ['D', 'A'], 'C': ['D'], 'D': []}, 'A', 10)
Expected Output: ['A', 'B', 'D', 'C']
Explanation: DFS starts at A. With neighbor order preserved, it visits B before C, then D from B, and finally C. The cycle back to A is ignored.
Input: ({'A': ['B', 'B', 'C'], 'B': ['D', 'D'], 'C': [], 'D': []}, 'A', 3)
Expected Output: ['A', 'B', 'D']
Explanation: Duplicate links are ignored, and the crawl stops after 3 visited pages.
Input: ({'A': ['B'], 'B': ['A']}, 'A', 0)
Expected Output: []
Explanation: Edge case: when max_pages is 0, nothing is visited.
Input: ({'A': ['B']}, 'X', 2)
Expected Output: ['X']
Explanation: Edge case: start_url is not in the graph, so it is still visited once and has no outgoing links.
Hints
- Use a stack, not recursion, so you do not depend on Python's recursion limit.
- To make DFS follow the listed neighbor order, push neighbors onto the stack from right to left.
Part 2: Implement a Breadth-First Web Crawler
Constraints
- 0 <= max_pages <= 100000
- graph maps URL strings to lists of URL strings
- Links may contain duplicates
- The graph may contain cycles and self-loops
- A URL missing from graph should be treated as having no outgoing links
Examples
Input: ({'A': ['B', 'C'], 'B': ['D', 'A'], 'C': ['D', 'E'], 'D': [], 'E': []}, 'A', 10)
Expected Output: ['A', 'B', 'C', 'D', 'E']
Explanation: BFS visits all nodes one level at a time: first A, then B and C, then D and E.
Input: ({'A': ['B', 'B', 'C'], 'B': ['D'], 'C': ['D', 'E'], 'D': [], 'E': []}, 'A', 4)
Expected Output: ['A', 'B', 'C', 'D']
Explanation: Duplicate links are ignored, and the crawl stops after 4 visited pages.
Input: ({'A': ['B'], 'B': ['A']}, 'A', 0)
Expected Output: []
Explanation: Edge case: when max_pages is 0, nothing is visited.
Input: ({}, 'X', 3)
Expected Output: ['X']
Explanation: Edge case: an empty graph still allows the starting URL to be visited once.
Hints
- Use a queue to process pages level by level from the starting URL.
- Mark a URL as seen when you enqueue it, not after you dequeue it, to avoid duplicates entering the queue multiple times.