How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at LinkedIn.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at LinkedIn during technical interviews.

Sample index from weighted probability distribution

Quick Overview

This question evaluates understanding of sampling from discrete probability distributions, randomized algorithm design, handling unnormalized weights and expected-value analysis for rejection-style approaches, along with trade-offs for scalable implementations.

Company: LinkedIn

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

Given an array `weights[0..M-1]` representing a discrete distribution over `M` outcomes, implement a function `sampleIndex(weights)` that returns an index `i` with probability proportional to `weights[i]`. Assume all `weights[i] >= 0` and at least one weight is positive. Follow-ups (handle in the same discussion/solution): 1) What if the values in `weights` do **not** sum to `1` (they are unnormalized)? Provide at least two ways to handle this. 2) If you use a rejection-sampling-style approach when the sum is not `1`, what is the **expected number of trials** as a function of the total sum? 3) If `M` is very large and the probability mass is highly concentrated on a small number of indices, what engineering/algorithmic optimizations would you consider? Discuss trade-offs (time, memory, preprocessing cost, update cost).

Quick Answer: This question evaluates understanding of sampling from discrete probability distributions, randomized algorithm design, handling unnormalized weights and expected-value analysis for rejection-style approaches, along with trade-offs for scalable implementations.

Part 1: Deterministic weighted sample by ticket threshold

You are given a list of nonnegative integer `weights`. Imagine outcome `i` owns exactly `weights[i]` consecutive tickets in the range `[0, sum(weights) - 1]`. Given an integer ticket `t`, return the index that owns ticket `t`. If `t` were chosen uniformly at random, this would sample index `i` with probability proportional to `weights[i]`.

Constraints

1 <= len(weights) <= 200000
0 <= weights[i] <= 10^9
sum(weights) > 0
0 <= t < sum(weights)

Examples

Input: ([1, 3, 2], 0)

Expected Output: 0

Explanation: The ticket ranges are: index 0 -> [0], index 1 -> [1, 2, 3], index 2 -> [4, 5]. Ticket 0 belongs to index 0.

Input: ([1, 3, 2], 3)

Expected Output: 1

Explanation: Index 1 owns tickets 1 through 3, so ticket 3 belongs to index 1.

Hints

Build prefix sums so each index corresponds to an interval of ticket values.
You need the first prefix sum that is strictly greater than `t`.

Part 2: Two equivalent ways to sample from unnormalized weights

The values in `weights` are nonnegative integers, but they do not necessarily sum to 1. For each rational query `u = numerator / denominator` in `[0, 1)`, compute the sampled index in two ways: (A) normalize the weights into probabilities and sample using cumulative probabilities, and (B) leave the weights unnormalized, scale the query by the total sum, and sample using cumulative weights. Return both result lists. The two methods should always agree.

Constraints

1 <= len(weights) <= 200000
0 <= weights[i] <= 10^9
sum(weights) > 0
1 <= len(queries) <= 200000
For each query `(a, b)`: 0 <= a < b <= 10^9

Examples

Input: ([2, 2, 2], [(0, 1), (1, 4), (1, 3), (1, 2), (2, 3), (5, 6)])

Expected Output: ([0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2])

Explanation: The total weight is 6, so the intervals are [0, 1/3), [1/3, 2/3), and [2/3, 1). Boundary values like 1/3 and 2/3 move to the next index because the cumulative value must be strictly greater than the query.

Input: ([7], [(0, 1), (1, 2), (999, 1000)])

Expected Output: ([0, 0, 0], [0, 0, 0])

Explanation: There is only one index, so every valid query samples index 0.

Hints

Method A compares `u` against cumulative probabilities `prefix[i] / total`.
Method B compares `u * total` against cumulative raw weights, so you do not need to normalize explicitly.

Part 3: Expected number of rejection-sampling trials

Suppose outcome `i` has probability `weights[i] / scale`, and the total probability mass `S = sum(weights) / scale` satisfies `0 < S <= 1`. In one trial, you either output some index with total probability `S`, or reject with probability `1 - S` and try again. Compute the expected number of trials until the first accepted output, and return it as a reduced fraction `(numerator, denominator)`.

Constraints

1 <= len(weights) <= 200000
0 <= weights[i] <= 10^9
1 <= scale <= 10^18
0 < sum(weights) <= scale

Examples

Input: ([1, 2], 10)

Expected Output: (10, 3)

Explanation: The total acceptance probability is (1 + 2) / 10 = 3/10. The expected number of trials is 1 / (3/10) = 10/3.

Input: ([2, 3, 5], 10)

Expected Output: (1, 1)

Explanation: The total acceptance probability is 10/10 = 1, so the first trial always succeeds. The expectation is 1.

Hints

A single trial succeeds with probability `S = sum(weights) / scale`.
The number of trials until first success follows a geometric distribution with expectation `1 / S`.

Part 4: Compressed sampler for large concentrated distributions

When `M` is very large but only a small support carries positive probability mass, a practical optimization is to compress the distribution to just the positive-weight indices. Given `weights` and many ticket queries `t`, preprocess the positive support and answer each query using only that compressed representation. Return both the support indices and the sampled original indices. This optimization is especially useful for many queries on a static distribution; the trade-off is an `O(M)` preprocessing pass and rebuild cost if weights change frequently.

Constraints

1 <= len(weights) <= 200000
0 <= weights[i] <= 10^9
sum(weights) > 0
1 <= len(queries) <= 200000
For every query `t`: 0 <= t < sum(weights)
The distribution is static across all queries

Examples

Input: ([0, 0, 3, 0, 2], [0, 2, 3, 4])

Expected Output: ([2, 4], [2, 2, 4, 4])

Explanation: Only indices 2 and 4 have positive weight, so the compressed support is [2, 4]. Their cumulative weights are [3, 5]. Tickets 0 and 2 fall in index 2's range, while 3 and 4 fall in index 4's range.

Input: ([7, 0, 0, 0], [0])

Expected Output: ([0], [0])

Explanation: Index 0 is the only positive-weight index, so every valid ticket maps to 0.

Hints

Indices with weight 0 can never be sampled, so they do not need to appear in the search structure.
Build prefix sums only over the positive-weight support, then binary-search each ticket there.

Quick Overview