PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCareers

Quick Overview

This question evaluates understanding of sampling from discrete probability distributions, randomized algorithm design, handling unnormalized weights and expected-value analysis for rejection-style approaches, along with trade-offs for scalable implementations.

  • medium
  • LinkedIn
  • Coding & Algorithms
  • Machine Learning Engineer

Sample index from weighted probability distribution

Company: LinkedIn

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

Given an array `weights[0..M-1]` representing a discrete distribution over `M` outcomes, implement a function `sampleIndex(weights)` that returns an index `i` with probability proportional to `weights[i]`. Assume all `weights[i] >= 0` and at least one weight is positive. Follow-ups (handle in the same discussion/solution): 1) What if the values in `weights` do **not** sum to `1` (they are unnormalized)? Provide at least two ways to handle this. 2) If you use a rejection-sampling-style approach when the sum is not `1`, what is the **expected number of trials** as a function of the total sum? 3) If `M` is very large and the probability mass is highly concentrated on a small number of indices, what engineering/algorithmic optimizations would you consider? Discuss trade-offs (time, memory, preprocessing cost, update cost).

Quick Answer: This question evaluates understanding of sampling from discrete probability distributions, randomized algorithm design, handling unnormalized weights and expected-value analysis for rejection-style approaches, along with trade-offs for scalable implementations.

Part 1: Deterministic weighted sample by ticket threshold

You are given a list of nonnegative integer `weights`. Imagine outcome `i` owns exactly `weights[i]` consecutive tickets in the range `[0, sum(weights) - 1]`. Given an integer ticket `t`, return the index that owns ticket `t`. If `t` were chosen uniformly at random, this would sample index `i` with probability proportional to `weights[i]`.

Constraints

  • 1 <= len(weights) <= 200000
  • 0 <= weights[i] <= 10^9
  • sum(weights) > 0
  • 0 <= t < sum(weights)

Examples

Input: ([1, 3, 2], 0)

Expected Output: 0

Explanation: The ticket ranges are: index 0 -> [0], index 1 -> [1, 2, 3], index 2 -> [4, 5]. Ticket 0 belongs to index 0.

Input: ([1, 3, 2], 3)

Expected Output: 1

Explanation: Index 1 owns tickets 1 through 3, so ticket 3 belongs to index 1.

Input: ([1, 3, 2], 4)

Expected Output: 2

Explanation: Index 2 owns tickets 4 and 5, so ticket 4 belongs to index 2.

Input: ([0, 2, 0, 3], 2)

Expected Output: 3

Explanation: Index 0 and 2 own no tickets. Index 1 owns tickets 0 and 1, and index 3 owns tickets 2, 3, and 4. So ticket 2 belongs to index 3.

Input: ([5], 4)

Expected Output: 0

Explanation: There is only one index, and it owns tickets 0 through 4.

Input: ([], 0)

Expected Output: -1

Explanation: There are no weights and no valid tickets, so return -1.

Input: ([0, 0, 0], 0)

Expected Output: -1

Explanation: The total number of tickets is 0, so no ticket is valid.

Input: ([2, 1], 3)

Expected Output: -1

Explanation: The valid ticket range is 0 through 2. Ticket 3 is out of range.

Hints

  1. Build prefix sums so each index corresponds to an interval of ticket values.
  2. You need the first prefix sum that is strictly greater than `t`.

Part 2: Two equivalent ways to sample from unnormalized weights

The values in `weights` are nonnegative integers, but they do not necessarily sum to 1. For each rational query `u = numerator / denominator` in `[0, 1)`, compute the sampled index in two ways: (A) normalize the weights into probabilities and sample using cumulative probabilities, and (B) leave the weights unnormalized, scale the query by the total sum, and sample using cumulative weights. Return both result lists. The two methods should always agree.

Constraints

  • 1 <= len(weights) <= 200000
  • 0 <= weights[i] <= 10^9
  • sum(weights) > 0
  • 1 <= len(queries) <= 200000
  • For each query `(a, b)`: 0 <= a < b <= 10^9

Examples

Input: ([2, 2, 2], [(0, 1), (1, 4), (1, 3), (1, 2), (2, 3), (5, 6)])

Expected Output: ([0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2])

Explanation: The total weight is 6, so the intervals are [0, 1/3), [1/3, 2/3), and [2/3, 1). Boundary values like 1/3 and 2/3 move to the next index because the cumulative value must be strictly greater than the query.

Input: ([7], [(0, 1), (1, 2), (999, 1000)])

Expected Output: ([0, 0, 0], [0, 0, 0])

Explanation: There is only one index, so every valid query samples index 0.

Input: ([0, 5, 0, 5], [(0, 1), (49, 100), (1, 2), (99, 100)])

Expected Output: ([1, 1, 3, 3], [1, 1, 3, 3])

Explanation: Zero weights create empty intervals. Query 1/2 lands exactly on the boundary after the second positive block, so the answer becomes index 3.

Input: ([1, 3, 2, 4], [(0, 1), (1, 10), (39, 100), (3, 5), (9, 10)])

Expected Output: ([0, 1, 1, 3, 3], [0, 1, 1, 3, 3])

Explanation: The prefix sums are [1, 4, 6, 10]. For example, 1/10 corresponds to threshold 1, so the first prefix strictly greater than 1 is 4 at index 1.

Input: ([3, 1], [])

Expected Output: ([], [])

Explanation: With no queries, both result lists are empty.

Hints

  1. Method A compares `u` against cumulative probabilities `prefix[i] / total`.
  2. Method B compares `u * total` against cumulative raw weights, so you do not need to normalize explicitly.

Part 3: Expected number of rejection-sampling trials

Suppose outcome `i` has probability `weights[i] / scale`, and the total probability mass `S = sum(weights) / scale` satisfies `0 < S <= 1`. In one trial, you either output some index with total probability `S`, or reject with probability `1 - S` and try again. Compute the expected number of trials until the first accepted output, and return it as a reduced fraction `(numerator, denominator)`.

Constraints

  • 1 <= len(weights) <= 200000
  • 0 <= weights[i] <= 10^9
  • 1 <= scale <= 10^18
  • 0 < sum(weights) <= scale

Examples

Input: ([1, 2], 10)

Expected Output: (10, 3)

Explanation: The total acceptance probability is (1 + 2) / 10 = 3/10. The expected number of trials is 1 / (3/10) = 10/3.

Input: ([2, 3, 5], 10)

Expected Output: (1, 1)

Explanation: The total acceptance probability is 10/10 = 1, so the first trial always succeeds. The expectation is 1.

Input: ([0, 4, 0], 8)

Expected Output: (2, 1)

Explanation: The total acceptance probability is 4/8 = 1/2. The expected number of trials is 1 / (1/2) = 2.

Input: ([6], 9)

Expected Output: (3, 2)

Explanation: With a single outcome, the acceptance probability is 6/9 = 2/3. The expected number of trials is 1 / (2/3) = 3/2.

Input: ([7, 7], 28)

Expected Output: (2, 1)

Explanation: The total acceptance probability is 14/28 = 1/2. The expected number of trials is 2.

Hints

  1. A single trial succeeds with probability `S = sum(weights) / scale`.
  2. The number of trials until first success follows a geometric distribution with expectation `1 / S`.

Part 4: Compressed sampler for large concentrated distributions

When `M` is very large but only a small support carries positive probability mass, a practical optimization is to compress the distribution to just the positive-weight indices. Given `weights` and many ticket queries `t`, preprocess the positive support and answer each query using only that compressed representation. Return both the support indices and the sampled original indices. This optimization is especially useful for many queries on a static distribution; the trade-off is an `O(M)` preprocessing pass and rebuild cost if weights change frequently.

Constraints

  • 1 <= len(weights) <= 200000
  • 0 <= weights[i] <= 10^9
  • sum(weights) > 0
  • 1 <= len(queries) <= 200000
  • For every query `t`: 0 <= t < sum(weights)
  • The distribution is static across all queries

Examples

Input: ([0, 0, 3, 0, 2], [0, 2, 3, 4])

Expected Output: ([2, 4], [2, 2, 4, 4])

Explanation: Only indices 2 and 4 have positive weight, so the compressed support is [2, 4]. Their cumulative weights are [3, 5]. Tickets 0 and 2 fall in index 2's range, while 3 and 4 fall in index 4's range.

Input: ([7, 0, 0, 0], [0])

Expected Output: ([0], [0])

Explanation: Index 0 is the only positive-weight index, so every valid ticket maps to 0.

Input: ([], [])

Expected Output: ([], [])

Explanation: With no weights and no queries, both the support and sampled outputs are empty.

Input: ([1, 0, 1, 0, 1], [0, 1, 2])

Expected Output: ([0, 2, 4], [0, 2, 4])

Explanation: Each positive index has weight 1, so the three ticket positions map directly to indices 0, 2, and 4.

Input: ([0, 5, 0, 2], [4, 5, 6])

Expected Output: ([1, 3], [1, 3, 3])

Explanation: Compressed support is [1, 3] with cumulative weights [5, 7]. Ticket 4 still belongs to index 1, while tickets 5 and 6 belong to index 3.

Hints

  1. Indices with weight 0 can never be sampled, so they do not need to appear in the search structure.
  2. Build prefix sums only over the positive-weight support, then binary-search each ticket there.
Last updated: May 6, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • Careers
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Count Trips From Vehicle Logs - LinkedIn (easy)
  • Design O(1) Randomized Multiset - LinkedIn (easy)
  • Process Mutable Matrix Sum Queries - LinkedIn (medium)
  • Design a Randomized Multiset - LinkedIn (medium)
  • Can You Place N Objects? - LinkedIn (medium)