Randomized Sampling Algorithms

What's being tested

Interviewers are probing randomized sampling correctness: whether your sample is unbiased, representative, and efficient under constraints like class imbalance, streaming data, or weighted probabilities. For a Data Scientist, the key is connecting algorithmic sampling choices to downstream model bias, metric validity, and generalization.

Patterns & templates

Reservoir sampling for unknown-length streams — keep k items; replace index randint(0, i) when < k; O(n) time, O(k) space.
Weighted categorical sampling with prefix sums — build cumulative weights, draw u ~ Uniform(0, total), binary search via bisect; O(log k) per sample.
Alias method for repeated weighted draws — preprocess probabilities into prob and alias tables; O(k) setup, O(1) sampling; watch normalization errors.
Stratified sampling for imbalanced labels — sample within each class or segment, then reweight losses/metrics using population prevalence to avoid biased estimates.
Sample-to-population validation — compare feature distributions using KS tests, standardized mean differences, label prevalence, and segment-level metric drift.
Model evaluation under imbalance — prefer PR-AUC, recall@k, calibration, cost-weighted loss; accuracy often hides minority-class failure.
Tree overfitting control — tune max_depth, min_samples_leaf, subsample, colsample_bytree, and validation splits; resampling can amplify leakage.

Common pitfalls

Pitfall: Treating an oversampled training set as if it reflects production prevalence; correct with class weights, prior correction, or population-weighted evaluation.

Pitfall: Implementing weighted sampling with repeated list expansion; it is memory-heavy and fails when weights are large or non-integer.

Pitfall: Forgetting that reservoir sampling replacement probability changes with stream index; every item must end with probability k / n.

Practice these

The practice cards below cover the canonical variants — solve all of them and time yourself.

What's being tested

Patterns & templates

Common pitfalls

Practice these

Featured in interview prep guides

Practice questions

Related concepts