Implement weighted sampling without replacement
Company: Uber
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's ability to design numerically stable, streaming weighted-sampling methods and tests competencies in randomized algorithms, probability, numerical analysis, and memory/time trade-offs.
Constraints
- 0 <= k
- The number of elements N can be as large as 1e8, so the algorithm must be one-pass and use O(k) extra memory
- Each weight must be a positive finite number; weights may be as large as 1e300
- If k > N, return all items; duplicates in `items` are allowed and are treated as distinct positions
Examples
Input: (['a', 'b', 'c'], [1.0, 1.0, 1.0], 2, 0)
Expected Output: ['a', 'b']
Explanation: With seed 0, the generated priorities for equal weights rank `a` and `b` above `c`, so those two are selected.
Input: (['huge'], [1e300], 1, 123)
Expected Output: ['huge']
Explanation: A single valid item must always be returned, even with an extremely large weight.
Input: (['a', 'b', 'c'], [1.0, 2.0, 3.0], 2, 1)
Expected Output: ['b', 'c']
Explanation: Using seed 1, the scores `log(u)/w` for `b` and `c` are larger than the score for `a`, so `b` and `c` are kept. They are returned in original order.
Input: ([], [], 3, 9)
Expected Output: []
Explanation: Sampling from an empty input returns an empty list.
Input: (['x', 'y', 'z'], [5.0, 1.0, 2.0], 0, 99)
Expected Output: []
Explanation: When `k` is 0, the result is always empty after validating the weights.
Input: (['dup', 'dup', 'last'], [1.0, 4.0, 2.0], 10, 5)
Expected Output: ['dup', 'dup', 'last']
Explanation: If `k > N`, return all items in original order. Duplicate item values are preserved as separate positions.
Solution
def solution(items, weights, k, seed):
import heapq
import math
import random
if k < 0:
raise ValueError('k must be non-negative')
if len(items) != len(weights):
raise ValueError('items and weights must have the same length')
n = len(items)
rng = random.Random(seed)
heap = []
for idx, (item, w) in enumerate(zip(items, weights)):
try:
if not math.isfinite(w) or w <= 0:
raise ValueError('weights must be positive and finite')
except TypeError:
raise ValueError('weights must be positive and finite')
if k == 0 or k >= n:
continue
u = rng.random()
while u <= 0.0:
u = rng.random()
score = math.log(u) / w
entry = (score, idx, item)
if len(heap) < k:
heapq.heappush(heap, entry)
elif score > heap[0][0]:
heapq.heapreplace(heap, entry)
if k == 0:
return []
if k >= n:
return list(items)
heap.sort(key=lambda entry: entry[1])
return [item for _, _, item in heap]Time complexity: O(N log min(k, N)). Space complexity: O(min(k, N)) extra space.
Hints
- Avoid computing `u ** (1.0 / w)` directly; comparing `log(u) / w` gives the same ordering and is much more stable for huge weights.
- Use a size-k min-heap to keep only the current best k candidates while scanning the stream once.