PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates a candidate's ability to implement a rate-limited web crawler, testing skills in crawl frontier management, prioritization by page-title initial, deduplication, rate limiting, and robust error handling.

  • medium
  • Glean
  • Coding & Algorithms
  • Machine Learning Engineer

Implement Rate-Limited Wikipedia Crawler

Company: Glean

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

Implement a rate-limited Wikipedia crawler. You may use an AI coding assistant during the interview, such as Cursor or Claude Code. You are expected to understand, modify, and explain the generated code. Requirements: - Start from a given Wikipedia page title, for example `Google`. - Fetch the page and extract outgoing Wikipedia links. - Maintain a set of visited page titles. - Track the first character of each visited page title, considering the English letters `A` through `Z` case-insensitively. - When choosing the next page to crawl, prioritize pages whose first character has not yet appeared among visited pages. - For example, if the start page is `Google`, then `G` has already appeared, so links whose titles start with `G` should not be prioritized while other unseen starting letters remain available. - Once all 26 starting letters have appeared, choose additional pages to crawl randomly from the remaining frontier. - Respect a configurable rate limit, such as at most `R` requests per second. - Avoid crawling the same page twice. - Handle common failures such as invalid pages, duplicate links, HTTP errors, and empty frontiers. Design and implement the crawler interface. You may assume helper functions for fetching page HTML or API responses if necessary, but the crawling logic, prioritization, deduplication, and rate limiting should be implemented clearly.

Quick Answer: This question evaluates a candidate's ability to implement a rate-limited web crawler, testing skills in crawl frontier management, prioritization by page-title initial, deduplication, rate limiting, and robust error handling.

Implement a function that simulates the core logic of a Wikipedia crawler over a provided page graph. The input `pages` maps a valid page title to the list of outgoing page titles found on that page. If a crawled title is missing from `pages`, treat it as a failed fetch (invalid page, HTTP error, deleted page, etc.) that produces no outgoing links. Start from `start`, crawl at most `max_pages` titles, never crawl the same title twice, and ignore duplicate links. Track the first character of every crawled title: if it is an English letter, convert it to uppercase and add it to the set of seen letters. When choosing the next title from the frontier, first prefer titles whose starting letter has not been seen yet. If multiple such titles exist, choose the lexicographically smallest one. If no such title exists and fewer than 26 letters have been seen, choose the lexicographically smallest title in the frontier. Once all 26 letters A-Z have been seen, choose from the lexicographically sorted frontier using `index = seed % len(frontier)`, then update `seed = (seed * 1103515245 + 12345) % 2**31` for the next such choice. Respect the rate limit by scheduling the i-th request at time `i / rate_limit` seconds (0-indexed), rounded to 6 decimal places. Return the crawl order, request times, and sorted seen letters.

Constraints

  • 0 <= max_pages <= 2000
  • 1 <= rate_limit <= 10^6
  • Each page title is a non-empty string
  • The total number of outgoing links across all entries in `pages` is at most 10000
  • Linked titles may be missing from `pages` and should be treated as failed fetches
  • Outgoing link lists may contain duplicates

Examples

Input: ({'Google': ['gmail', 'Maps', 'Google', 'Amazon', 'gmail'], 'Amazon': ['Zoom', 'apple'], 'Maps': ['Amazon', 'Zoom'], 'Zoom': [], 'apple': []}, 'Google', 6, 2, 0)

Expected Output: {'order': ['Google', 'Amazon', 'Maps', 'Zoom', 'apple', 'gmail'], 'request_times': [0.0, 0.5, 1.0, 1.5, 2.0, 2.5], 'seen_letters': ['A', 'G', 'M', 'Z']}

Explanation: Start with Google, so G is seen. From its links, Amazon and Maps are prioritized over gmail because A and M are unseen while G is already seen. Duplicate links and the self-link to Google are ignored. After Zoom, only apple and gmail remain, and neither introduces a new letter, so lexicographic order picks apple before gmail. The title gmail is missing from `pages`, so it is crawled successfully as an attempt but yields no outgoing links.

Input: ({'Alpha': ['Beta'], 'Beta': []}, 'Ghost', 3, 1, 7)

Expected Output: {'order': ['Ghost'], 'request_times': [0.0], 'seen_letters': ['G']}

Explanation: Ghost is crawled first even though it is missing from `pages`. Missing pages produce no outgoing links, so the frontier becomes empty immediately.

Input: ({'1Start': ['!Bang', 'alpha', 'Alpha', '!Bang'], '!Bang': ['beta'], 'Alpha': ['beta'], 'beta': []}, '1Start', 4, 3, 99)

Expected Output: {'order': ['1Start', 'Alpha', 'beta', '!Bang'], 'request_times': [0.0, 0.333333, 0.666667, 1.0], 'seen_letters': ['A', 'B']}

Explanation: The title 1Start does not contribute a letter. Alpha is chosen before alpha because both introduce A but Alpha is lexicographically smaller. Then beta is prioritized because B is unseen. After that, neither !Bang nor alpha introduces a new letter, so the lexicographically smaller !Bang is chosen.

Input: ({}, 'Anything', 0, 5, 0)

Expected Output: {'order': [], 'request_times': [], 'seen_letters': []}

Explanation: If `max_pages` is 0, nothing is crawled.

Hints

  1. Keep a frontier of discovered-but-not-yet-crawled titles separate from the visited set so you can deduplicate efficiently.
  2. Write a helper that maps the first character of a title to `A`-`Z` or returns `None`; this makes prioritization and seen-letter tracking much simpler.
Last updated: May 19, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Return Top Department Suggestions - Glean (medium)
  • Find Earliest Train Route - Glean (medium)
  • Search Words in a Character Grid - Glean (hard)
  • Find the Kth Largest in Two Sorted Arrays - Glean (medium)
  • Implement 2048 Game Logic - Glean (medium)