Calculate 95% Bootstrap Confidence Interval for Order Values
Company: Pinterest
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
##### Scenario
An e-commerce firm wants a 95% confidence interval for the average order value but only has a single historical sample of order amounts.
##### Question
Given an array of past order values, write efficient Python code to return the 95% bootstrap confidence interval using 10,000 resamples. Explain your approach and any performance optimizations.
##### Hints
Use vectorized resampling (np.random.choice) and percentile bounds; avoid Python loops.
Quick Answer: This question evaluates a data scientist's competence in statistical inference using bootstrap resampling, proficiency with numerical computing for large sample operations, and attention to performance optimization.
Given a non-empty list of historical order values (floats), compute a two-sided 95% bootstrap confidence interval for the mean using exactly 10,000 resamples with replacement. Use NumPy's Generator-based RNG for reproducibility: numpy.random.default_rng(seed).choice. Return the 2.5th and 97.5th percentile bounds of the bootstrap sample means as a list [low, high], rounded to 6 decimal places. If the list has one unique value, the interval is that value for both bounds.
Constraints
- 1 <= len(order_values) <= 5000
- Order values are finite floats (can be zero or positive)
- Use exactly B = 10,000 bootstrap resamples with replacement
- RNG must be numpy.random.default_rng(seed) for determinism
- Percentile bounds are [2.5, 97.5]
- Return a list of two floats rounded to 6 decimals
Solution
def bootstrap_ci_95(order_values: list[float], seed: int = 42) -> list[float]:
import numpy as np
arr = np.asarray(order_values, dtype=float)
if arr.size == 0:
raise ValueError("order_values must be non-empty")
B = 10000
n = arr.size
rng = np.random.default_rng(seed)
# Choose a batch size to balance memory and speed
# Ensures batch * n is bounded to keep memory reasonable
max_draws = 5_000_000 # adjust as needed for environment
batch = max(1, min(B, int(max_draws // max(1, n))))
means = np.empty(B, dtype=float)
start = 0
while start < B:
bs = min(batch, B - start)
samples = rng.choice(arr, size=(bs, n), replace=True)
means[start:start + bs] = samples.mean(axis=1)
start += bs
low, high = np.percentile(means, [2.5, 97.5])
return [round(float(low), 6), round(float(high), 6)]
Explanation
Convert the input to a NumPy array. Use default_rng(seed) for reproducible sampling. Generate bootstrap resamples with replacement in vectorized batches to manage memory. For each batch, compute row-wise means and store them. After collecting 10,000 bootstrap means, compute the 2.5th and 97.5th percentiles to form the two-sided 95% confidence interval. Round both bounds to 6 decimals before returning.
Time complexity: O(B * n). Space complexity: O(B + batch * n).
Hints
- Use numpy.random.default_rng(seed).choice to generate resamples in a vectorized way.
- Compute means along axis=1 and then np.percentile at [2.5, 97.5].
- To limit memory, generate resamples in batches (e.g., 1000 at a time) while keeping vectorization within each batch.