Implement Rate-Limited Wikipedia Crawler
Company: Glean
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
Quick Answer: This question evaluates a candidate's ability to implement a rate-limited web crawler, testing skills in crawl frontier management, prioritization by page-title initial, deduplication, rate limiting, and robust error handling.
Constraints
- 0 <= max_pages <= 2000
- 1 <= rate_limit <= 10^6
- Each page title is a non-empty string
- The total number of outgoing links across all entries in `pages` is at most 10000
- Linked titles may be missing from `pages` and should be treated as failed fetches
- Outgoing link lists may contain duplicates
Examples
Input: ({'Google': ['gmail', 'Maps', 'Google', 'Amazon', 'gmail'], 'Amazon': ['Zoom', 'apple'], 'Maps': ['Amazon', 'Zoom'], 'Zoom': [], 'apple': []}, 'Google', 6, 2, 0)
Expected Output: {'order': ['Google', 'Amazon', 'Maps', 'Zoom', 'apple', 'gmail'], 'request_times': [0.0, 0.5, 1.0, 1.5, 2.0, 2.5], 'seen_letters': ['A', 'G', 'M', 'Z']}
Explanation: Start with Google, so G is seen. From its links, Amazon and Maps are prioritized over gmail because A and M are unseen while G is already seen. Duplicate links and the self-link to Google are ignored. After Zoom, only apple and gmail remain, and neither introduces a new letter, so lexicographic order picks apple before gmail. The title gmail is missing from `pages`, so it is crawled successfully as an attempt but yields no outgoing links.
Input: ({'Alpha': ['Beta'], 'Beta': []}, 'Ghost', 3, 1, 7)
Expected Output: {'order': ['Ghost'], 'request_times': [0.0], 'seen_letters': ['G']}
Explanation: Ghost is crawled first even though it is missing from `pages`. Missing pages produce no outgoing links, so the frontier becomes empty immediately.
Input: ({'1Start': ['!Bang', 'alpha', 'Alpha', '!Bang'], '!Bang': ['beta'], 'Alpha': ['beta'], 'beta': []}, '1Start', 4, 3, 99)
Expected Output: {'order': ['1Start', 'Alpha', 'beta', '!Bang'], 'request_times': [0.0, 0.333333, 0.666667, 1.0], 'seen_letters': ['A', 'B']}
Explanation: The title 1Start does not contribute a letter. Alpha is chosen before alpha because both introduce A but Alpha is lexicographically smaller. Then beta is prioritized because B is unseen. After that, neither !Bang nor alpha introduces a new letter, so the lexicographically smaller !Bang is chosen.
Input: ({}, 'Anything', 0, 5, 0)
Expected Output: {'order': [], 'request_times': [], 'seen_letters': []}
Explanation: If `max_pages` is 0, nothing is crawled.
Hints
- Keep a frontier of discovered-but-not-yet-crawled titles separate from the visited set so you can deduplicate efficiently.
- Write a helper that maps the first character of a title to `A`-`Z` or returns `None`; this makes prioritization and seen-letter tracking much simpler.