How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Glean.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Glean during technical interviews.

Implement Rate-Limited Wikipedia Crawler

Company: Glean

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

Implement a rate-limited Wikipedia crawler. You may use an AI coding assistant during the interview, such as Cursor or Claude Code. You are expected to understand, modify, and explain the generated code. Requirements: - Start from a given Wikipedia page title, for example `Google`. - Fetch the page and extract outgoing Wikipedia links. - Maintain a set of visited page titles. - Track the first character of each visited page title, considering the English letters `A` through `Z` case-insensitively. - When choosing the next page to crawl, prioritize pages whose first character has not yet appeared among visited pages. - For example, if the start page is `Google`, then `G` has already appeared, so links whose titles start with `G` should not be prioritized while other unseen starting letters remain available. - Once all 26 starting letters have appeared, choose additional pages to crawl randomly from the remaining frontier. - Respect a configurable rate limit, such as at most `R` requests per second. - Avoid crawling the same page twice. - Handle common failures such as invalid pages, duplicate links, HTTP errors, and empty frontiers. Design and implement the crawler interface. You may assume helper functions for fetching page HTML or API responses if necessary, but the crawling logic, prioritization, deduplication, and rate limiting should be implemented clearly.

Quick Answer: This question evaluates a candidate's ability to implement a rate-limited web crawler, testing skills in crawl frontier management, prioritization by page-title initial, deduplication, rate limiting, and robust error handling.

Implement a function that simulates the core logic of a Wikipedia crawler over a provided page graph. The input `pages` maps a valid page title to the list of outgoing page titles found on that page. If a crawled title is missing from `pages`, treat it as a failed fetch (invalid page, HTTP error, deleted page, etc.) that produces no outgoing links. Start from `start`, crawl at most `max_pages` titles, never crawl the same title twice, and ignore duplicate links. Track the first character of every crawled title: if it is an English letter, convert it to uppercase and add it to the set of seen letters. When choosing the next title from the frontier, first prefer titles whose starting letter has not been seen yet. If multiple such titles exist, choose the lexicographically smallest one. If no such title exists and fewer than 26 letters have been seen, choose the lexicographically smallest title in the frontier. Once all 26 letters A-Z have been seen, choose from the lexicographically sorted frontier using `index = seed % len(frontier)`, then update `seed = (seed * 1103515245 + 12345) % 2**31` for the next such choice. Respect the rate limit by scheduling the i-th request at time `i / rate_limit` seconds (0-indexed), rounded to 6 decimal places. Return the crawl order, request times, and sorted seen letters.

Constraints

0 <= max_pages <= 2000
1 <= rate_limit <= 10^6
Each page title is a non-empty string
The total number of outgoing links across all entries in `pages` is at most 10000
Linked titles may be missing from `pages` and should be treated as failed fetches
Outgoing link lists may contain duplicates

Examples

Input: ({'Google': ['gmail', 'Maps', 'Google', 'Amazon', 'gmail'], 'Amazon': ['Zoom', 'apple'], 'Maps': ['Amazon', 'Zoom'], 'Zoom': [], 'apple': []}, 'Google', 6, 2, 0)

Expected Output: {'order': ['Google', 'Amazon', 'Maps', 'Zoom', 'apple', 'gmail'], 'request_times': [0.0, 0.5, 1.0, 1.5, 2.0, 2.5], 'seen_letters': ['A', 'G', 'M', 'Z']}

Explanation: Start with Google, so G is seen. From its links, Amazon and Maps are prioritized over gmail because A and M are unseen while G is already seen. Duplicate links and the self-link to Google are ignored. After Zoom, only apple and gmail remain, and neither introduces a new letter, so lexicographic order picks apple before gmail. The title gmail is missing from `pages`, so it is crawled successfully as an attempt but yields no outgoing links.

Input: ({'Alpha': ['Beta'], 'Beta': []}, 'Ghost', 3, 1, 7)

Expected Output: {'order': ['Ghost'], 'request_times': [0.0], 'seen_letters': ['G']}

Explanation: Ghost is crawled first even though it is missing from `pages`. Missing pages produce no outgoing links, so the frontier becomes empty immediately.

Input: ({'1Start': ['!Bang', 'alpha', 'Alpha', '!Bang'], '!Bang': ['beta'], 'Alpha': ['beta'], 'beta': []}, '1Start', 4, 3, 99)

Expected Output: {'order': ['1Start', 'Alpha', 'beta', '!Bang'], 'request_times': [0.0, 0.333333, 0.666667, 1.0], 'seen_letters': ['A', 'B']}

Explanation: The title 1Start does not contribute a letter. Alpha is chosen before alpha because both introduce A but Alpha is lexicographically smaller. Then beta is prioritized because B is unseen. After that, neither !Bang nor alpha introduces a new letter, so the lexicographically smaller !Bang is chosen.

Input: ({}, 'Anything', 0, 5, 0)

Expected Output: {'order': [], 'request_times': [], 'seen_letters': []}

Explanation: If `max_pages` is 0, nothing is crawled.

Hints

Keep a frontier of discovered-but-not-yet-crawled titles separate from the visited set so you can deduplicate efficiently.
Write a helper that maps the first character of a title to `A`-`Z` or returns `None`; this makes prioritization and seen-letter tracking much simpler.

Quick Overview

This question evaluates a candidate's ability to implement a rate-limited web crawler, testing skills in crawl frontier management, prioritization by page-title initial, deduplication, rate limiting, and robust error handling.