How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Uber.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Uber during technical interviews.

Implement weighted sampling without replacement

Company: Uber

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Implement in Python a function sample_k(items, weights, k) that returns k items without replacement with probability proportional to weight. Requirements: (a) single pass over N up to 1e8 (streaming), (b) handles extreme weights up to 1e300 without overflow/underflow, (c) reproducible with a seed, (d) O(k) memory and near-linear time, and (e) rejects nonpositive weights gracefully. Explain and justify your algorithm (e.g., Efraimidis–Spirakis weighted reservoir using keys w_i^(1/u_i)), prove unbiasedness, discuss numerical stability, and outline tests (goodness-of-fit, seed determinism, edge cases).

Quick Answer: This question evaluates a candidate's ability to design numerically stable, streaming weighted-sampling methods and tests competencies in randomized algorithms, probability, numerical analysis, and memory/time trade-offs.

Implement `solution(items, weights, k, seed)`, which returns `k` sampled items without replacement, where each item's chance is proportional to its positive weight. The function must work in a single pass over the data and use only `O(k)` extra memory, so it should also handle very large streams. To remain numerically stable for weights as large as `1e300`, do not compute `u ** (1 / w)`. Instead, use the Efraimidis-Spirakis / exponential-race score `log(u) / w` for a random `u` in `(0, 1)`, and keep the `k` largest scores in a heap. This is unbiased because `t_i = -log(u_i) / w_i` are independent exponential clocks with rate `w_i`; the next selected item is therefore chosen with probability `w_i / sum(remaining weights)`, and the memoryless property extends this to repeated draws without replacement. Return sampled items in their original input order. If `k > N`, return all items. Raise `ValueError` for negative `k`, mismatched lengths, or nonpositive/non-finite weights. In discussion, be ready to describe goodness-of-fit tests, seed determinism, and edge-case tests.

Constraints

0 <= k
The number of elements N can be as large as 1e8, so the algorithm must be one-pass and use O(k) extra memory
Each weight must be a positive finite number; weights may be as large as 1e300
If k > N, return all items; duplicates in `items` are allowed and are treated as distinct positions

Examples

Input: (['a', 'b', 'c'], [1.0, 1.0, 1.0], 2, 0)

Expected Output: ['a', 'b']

Explanation: With seed 0, the generated priorities for equal weights rank `a` and `b` above `c`, so those two are selected.

Input: (['huge'], [1e300], 1, 123)

Expected Output: ['huge']

Explanation: A single valid item must always be returned, even with an extremely large weight.

Input: (['a', 'b', 'c'], [1.0, 2.0, 3.0], 2, 1)

Expected Output: ['b', 'c']

Explanation: Using seed 1, the scores `log(u)/w` for `b` and `c` are larger than the score for `a`, so `b` and `c` are kept. They are returned in original order.

Input: ([], [], 3, 9)

Expected Output: []

Explanation: Sampling from an empty input returns an empty list.

Input: (['x', 'y', 'z'], [5.0, 1.0, 2.0], 0, 99)

Expected Output: []

Explanation: When `k` is 0, the result is always empty after validating the weights.

Input: (['dup', 'dup', 'last'], [1.0, 4.0, 2.0], 10, 5)

Expected Output: ['dup', 'dup', 'last']

Explanation: If `k > N`, return all items in original order. Duplicate item values are preserved as separate positions.

Solution

def solution(items, weights, k, seed):
    import heapq
    import math
    import random

    if k < 0:
        raise ValueError('k must be non-negative')
    if len(items) != len(weights):
        raise ValueError('items and weights must have the same length')

    n = len(items)
    rng = random.Random(seed)
    heap = []

    for idx, (item, w) in enumerate(zip(items, weights)):
        try:
            if not math.isfinite(w) or w <= 0:
                raise ValueError('weights must be positive and finite')
        except TypeError:
            raise ValueError('weights must be positive and finite')

        if k == 0 or k >= n:
            continue

        u = rng.random()
        while u <= 0.0:
            u = rng.random()

        score = math.log(u) / w
        entry = (score, idx, item)

        if len(heap) < k:
            heapq.heappush(heap, entry)
        elif score > heap[0][0]:
            heapq.heapreplace(heap, entry)

    if k == 0:
        return []
    if k >= n:
        return list(items)

    heap.sort(key=lambda entry: entry[1])
    return [item for _, _, item in heap]

Time complexity: O(N log min(k, N)). Space complexity: O(min(k, N)) extra space.

Hints

Avoid computing `u ** (1.0 / w)` directly; comparing `log(u) / w` gives the same ordering and is much more stable for huge weights.
Use a size-k min-heap to keep only the current best k candidates while scanning the stream once.

Quick Overview

This question evaluates a candidate's ability to design numerically stable, streaming weighted-sampling methods and tests competencies in randomized algorithms, probability, numerical analysis, and memory/time trade-offs.