Randomized Sampling Algorithms
Asked of: Data Scientist
Last updated
What's being tested
Interviewers are probing randomized sampling correctness: whether your sample is unbiased, representative, and efficient under constraints like class imbalance, streaming data, or weighted probabilities. For a Data Scientist, the key is connecting algorithmic sampling choices to downstream model bias, metric validity, and generalization.
Patterns & templates
-
Reservoir sampling for unknown-length streams — keep
kitems; replace indexrandint(0, i)when< k;O(n)time,O(k)space. -
Weighted categorical sampling with prefix sums — build cumulative weights, draw
u ~ Uniform(0, total), binary search viabisect;O(log k)per sample. -
Alias method for repeated weighted draws — preprocess probabilities into
probandaliastables;O(k)setup,O(1)sampling; watch normalization errors. -
Stratified sampling for imbalanced labels — sample within each class or segment, then reweight losses/metrics using population prevalence to avoid biased estimates.
-
Sample-to-population validation — compare feature distributions using
KStests, standardized mean differences, label prevalence, and segment-level metric drift. -
Model evaluation under imbalance — prefer
PR-AUC, recall@k, calibration, cost-weighted loss; accuracy often hides minority-class failure. -
Tree overfitting control — tune
max_depth,min_samples_leaf,subsample,colsample_bytree, and validation splits; resampling can amplify leakage.
Common pitfalls
Pitfall: Treating an oversampled training set as if it reflects production prevalence; correct with class weights, prior correction, or population-weighted evaluation.
Pitfall: Implementing weighted sampling with repeated list expansion; it is memory-heavy and fails when weights are large or non-integer.
Pitfall: Forgetting that reservoir sampling replacement probability changes with stream index; every item must end with probability
k / n.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Featured in interview prep guides
Practice questions
- Implement fast sampling for weighted k-sided dieLinkedIn · Data Scientist · Technical Screen · Medium
- Implement stream random sampling in PythonLinkedIn · Data Scientist · Technical Screen · medium
- Train with imbalanced sampled dataLinkedIn · Data Scientist · Technical Screen · medium
- Handle imbalance, sampling, and overfittingLinkedIn · Data Scientist · Technical Screen · medium
Related concepts
- Heaps, Top-K, And Streaming SelectionCoding & Algorithms
- Top-K Selection, Heaps, And RankingCoding & Algorithms
- Real-Time Top-K And Streaming AnalyticsSystem Design
- Top-K, Heaps, Quickselect, And Frequency AnalysisCoding & Algorithms
- Streaming, Large Inputs, And External MemorySoftware Engineering Fundamentals
- Heaps, Priority Queues, and Top-K SelectionCoding & Algorithms