PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates a candidate's ability to design numerically stable, streaming weighted-sampling methods and tests competencies in randomized algorithms, probability, numerical analysis, and memory/time trade-offs.

  • Medium
  • Uber
  • Coding & Algorithms
  • Data Scientist

Implement weighted sampling without replacement

Company: Uber

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Implement in Python a function sample_k(items, weights, k) that returns k items without replacement with probability proportional to weight. Requirements: (a) single pass over N up to 1e8 (streaming), (b) handles extreme weights up to 1e300 without overflow/underflow, (c) reproducible with a seed, (d) O(k) memory and near-linear time, and (e) rejects nonpositive weights gracefully. Explain and justify your algorithm (e.g., Efraimidis–Spirakis weighted reservoir using keys w_i^(1/u_i)), prove unbiasedness, discuss numerical stability, and outline tests (goodness-of-fit, seed determinism, edge cases).

Quick Answer: This question evaluates a candidate's ability to design numerically stable, streaming weighted-sampling methods and tests competencies in randomized algorithms, probability, numerical analysis, and memory/time trade-offs.

Implement `solution(items, weights, k, seed)`, which returns `k` sampled items without replacement, where each item's chance is proportional to its positive weight. The function must work in a single pass over the data and use only `O(k)` extra memory, so it should also handle very large streams. To remain numerically stable for weights as large as `1e300`, do not compute `u ** (1 / w)`. Instead, use the Efraimidis-Spirakis / exponential-race score `log(u) / w` for a random `u` in `(0, 1)`, and keep the `k` largest scores in a heap. This is unbiased because `t_i = -log(u_i) / w_i` are independent exponential clocks with rate `w_i`; the next selected item is therefore chosen with probability `w_i / sum(remaining weights)`, and the memoryless property extends this to repeated draws without replacement. Return sampled items in their original input order. If `k > N`, return all items. Raise `ValueError` for negative `k`, mismatched lengths, or nonpositive/non-finite weights. In discussion, be ready to describe goodness-of-fit tests, seed determinism, and edge-case tests.

Constraints

  • 0 <= k
  • The number of elements N can be as large as 1e8, so the algorithm must be one-pass and use O(k) extra memory
  • Each weight must be a positive finite number; weights may be as large as 1e300
  • If k > N, return all items; duplicates in `items` are allowed and are treated as distinct positions

Examples

Input: (['a', 'b', 'c'], [1.0, 1.0, 1.0], 2, 0)

Expected Output: ['a', 'b']

Explanation: With seed 0, the generated priorities for equal weights rank `a` and `b` above `c`, so those two are selected.

Input: (['huge'], [1e300], 1, 123)

Expected Output: ['huge']

Explanation: A single valid item must always be returned, even with an extremely large weight.

Input: (['a', 'b', 'c'], [1.0, 2.0, 3.0], 2, 1)

Expected Output: ['b', 'c']

Explanation: Using seed 1, the scores `log(u)/w` for `b` and `c` are larger than the score for `a`, so `b` and `c` are kept. They are returned in original order.

Input: ([], [], 3, 9)

Expected Output: []

Explanation: Sampling from an empty input returns an empty list.

Input: (['x', 'y', 'z'], [5.0, 1.0, 2.0], 0, 99)

Expected Output: []

Explanation: When `k` is 0, the result is always empty after validating the weights.

Input: (['dup', 'dup', 'last'], [1.0, 4.0, 2.0], 10, 5)

Expected Output: ['dup', 'dup', 'last']

Explanation: If `k > N`, return all items in original order. Duplicate item values are preserved as separate positions.

Solution

def solution(items, weights, k, seed):
    import heapq
    import math
    import random

    if k < 0:
        raise ValueError('k must be non-negative')
    if len(items) != len(weights):
        raise ValueError('items and weights must have the same length')

    n = len(items)
    rng = random.Random(seed)
    heap = []

    for idx, (item, w) in enumerate(zip(items, weights)):
        try:
            if not math.isfinite(w) or w <= 0:
                raise ValueError('weights must be positive and finite')
        except TypeError:
            raise ValueError('weights must be positive and finite')

        if k == 0 or k >= n:
            continue

        u = rng.random()
        while u <= 0.0:
            u = rng.random()

        score = math.log(u) / w
        entry = (score, idx, item)

        if len(heap) < k:
            heapq.heappush(heap, entry)
        elif score > heap[0][0]:
            heapq.heapreplace(heap, entry)

    if k == 0:
        return []
    if k >= n:
        return list(items)

    heap.sort(key=lambda entry: entry[1])
    return [item for _, _, item in heap]

Time complexity: O(N log min(k, N)). Space complexity: O(min(k, N)) extra space.

Hints

  1. Avoid computing `u ** (1.0 / w)` directly; comparing `log(u) / w` gives the same ordering and is much more stable for huge weights.
  2. Use a size-k min-heap to keep only the current best k candidates while scanning the stream once.
Last updated: May 9, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Maximize Throughput and Count Trigger Components - Uber (medium)
  • Replace Dashes With Nearest Letters - Uber (medium)
  • Find Earliest Column With One - Uber (easy)
  • Solve Wonderful Strings and Grid Queries - Uber (hard)
  • Count Islands After Land Additions - Uber (medium)