Sample index from weighted probability distribution
Company: LinkedIn
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
Quick Answer: This question evaluates understanding of sampling from discrete probability distributions, randomized algorithm design, handling unnormalized weights and expected-value analysis for rejection-style approaches, along with trade-offs for scalable implementations.
Part 1: Deterministic weighted sample by ticket threshold
Constraints
- 1 <= len(weights) <= 200000
- 0 <= weights[i] <= 10^9
- sum(weights) > 0
- 0 <= t < sum(weights)
Examples
Input: ([1, 3, 2], 0)
Expected Output: 0
Explanation: The ticket ranges are: index 0 -> [0], index 1 -> [1, 2, 3], index 2 -> [4, 5]. Ticket 0 belongs to index 0.
Input: ([1, 3, 2], 3)
Expected Output: 1
Explanation: Index 1 owns tickets 1 through 3, so ticket 3 belongs to index 1.
Input: ([1, 3, 2], 4)
Expected Output: 2
Explanation: Index 2 owns tickets 4 and 5, so ticket 4 belongs to index 2.
Input: ([0, 2, 0, 3], 2)
Expected Output: 3
Explanation: Index 0 and 2 own no tickets. Index 1 owns tickets 0 and 1, and index 3 owns tickets 2, 3, and 4. So ticket 2 belongs to index 3.
Input: ([5], 4)
Expected Output: 0
Explanation: There is only one index, and it owns tickets 0 through 4.
Input: ([], 0)
Expected Output: -1
Explanation: There are no weights and no valid tickets, so return -1.
Input: ([0, 0, 0], 0)
Expected Output: -1
Explanation: The total number of tickets is 0, so no ticket is valid.
Input: ([2, 1], 3)
Expected Output: -1
Explanation: The valid ticket range is 0 through 2. Ticket 3 is out of range.
Hints
- Build prefix sums so each index corresponds to an interval of ticket values.
- You need the first prefix sum that is strictly greater than `t`.
Part 2: Two equivalent ways to sample from unnormalized weights
Constraints
- 1 <= len(weights) <= 200000
- 0 <= weights[i] <= 10^9
- sum(weights) > 0
- 1 <= len(queries) <= 200000
- For each query `(a, b)`: 0 <= a < b <= 10^9
Examples
Input: ([2, 2, 2], [(0, 1), (1, 4), (1, 3), (1, 2), (2, 3), (5, 6)])
Expected Output: ([0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2])
Explanation: The total weight is 6, so the intervals are [0, 1/3), [1/3, 2/3), and [2/3, 1). Boundary values like 1/3 and 2/3 move to the next index because the cumulative value must be strictly greater than the query.
Input: ([7], [(0, 1), (1, 2), (999, 1000)])
Expected Output: ([0, 0, 0], [0, 0, 0])
Explanation: There is only one index, so every valid query samples index 0.
Input: ([0, 5, 0, 5], [(0, 1), (49, 100), (1, 2), (99, 100)])
Expected Output: ([1, 1, 3, 3], [1, 1, 3, 3])
Explanation: Zero weights create empty intervals. Query 1/2 lands exactly on the boundary after the second positive block, so the answer becomes index 3.
Input: ([1, 3, 2, 4], [(0, 1), (1, 10), (39, 100), (3, 5), (9, 10)])
Expected Output: ([0, 1, 1, 3, 3], [0, 1, 1, 3, 3])
Explanation: The prefix sums are [1, 4, 6, 10]. For example, 1/10 corresponds to threshold 1, so the first prefix strictly greater than 1 is 4 at index 1.
Input: ([3, 1], [])
Expected Output: ([], [])
Explanation: With no queries, both result lists are empty.
Hints
- Method A compares `u` against cumulative probabilities `prefix[i] / total`.
- Method B compares `u * total` against cumulative raw weights, so you do not need to normalize explicitly.
Part 3: Expected number of rejection-sampling trials
Constraints
- 1 <= len(weights) <= 200000
- 0 <= weights[i] <= 10^9
- 1 <= scale <= 10^18
- 0 < sum(weights) <= scale
Examples
Input: ([1, 2], 10)
Expected Output: (10, 3)
Explanation: The total acceptance probability is (1 + 2) / 10 = 3/10. The expected number of trials is 1 / (3/10) = 10/3.
Input: ([2, 3, 5], 10)
Expected Output: (1, 1)
Explanation: The total acceptance probability is 10/10 = 1, so the first trial always succeeds. The expectation is 1.
Input: ([0, 4, 0], 8)
Expected Output: (2, 1)
Explanation: The total acceptance probability is 4/8 = 1/2. The expected number of trials is 1 / (1/2) = 2.
Input: ([6], 9)
Expected Output: (3, 2)
Explanation: With a single outcome, the acceptance probability is 6/9 = 2/3. The expected number of trials is 1 / (2/3) = 3/2.
Input: ([7, 7], 28)
Expected Output: (2, 1)
Explanation: The total acceptance probability is 14/28 = 1/2. The expected number of trials is 2.
Hints
- A single trial succeeds with probability `S = sum(weights) / scale`.
- The number of trials until first success follows a geometric distribution with expectation `1 / S`.
Part 4: Compressed sampler for large concentrated distributions
Constraints
- 1 <= len(weights) <= 200000
- 0 <= weights[i] <= 10^9
- sum(weights) > 0
- 1 <= len(queries) <= 200000
- For every query `t`: 0 <= t < sum(weights)
- The distribution is static across all queries
Examples
Input: ([0, 0, 3, 0, 2], [0, 2, 3, 4])
Expected Output: ([2, 4], [2, 2, 4, 4])
Explanation: Only indices 2 and 4 have positive weight, so the compressed support is [2, 4]. Their cumulative weights are [3, 5]. Tickets 0 and 2 fall in index 2's range, while 3 and 4 fall in index 4's range.
Input: ([7, 0, 0, 0], [0])
Expected Output: ([0], [0])
Explanation: Index 0 is the only positive-weight index, so every valid ticket maps to 0.
Input: ([], [])
Expected Output: ([], [])
Explanation: With no weights and no queries, both the support and sampled outputs are empty.
Input: ([1, 0, 1, 0, 1], [0, 1, 2])
Expected Output: ([0, 2, 4], [0, 2, 4])
Explanation: Each positive index has weight 1, so the three ticket positions map directly to indices 0, 2, and 4.
Input: ([0, 5, 0, 2], [4, 5, 6])
Expected Output: ([1, 3], [1, 3, 3])
Explanation: Compressed support is [1, 3] with cumulative weights [5, 7]. Ticket 4 still belongs to index 1, while tickets 5 and 6 belong to index 3.
Hints
- Indices with weight 0 can never be sampled, so they do not need to appear in the search structure.
- Build prefix sums only over the positive-weight support, then binary-search each ticket there.