This question evaluates a data scientist's competencies in experimental design and analytics, including stratified sampling and Neyman allocation, importance-weighted estimators, counterfactual click-based evaluation with propensity-aware methods, variance estimation and power analysis, and active-learning plus drift-monitoring strategies for long-tail query evaluation. Commonly asked in the Analytics & Experimentation domain to verify the ability to build statistically rigorous, budget-constrained evaluation pipelines that combine limited human labels with logged click data, it tests both conceptual understanding of sampling, causal-identification and statistical guarantees and practical application in variance computation, power calculations, and operational monitoring.
You serve ~100M queries/day. Query frequencies follow a Pareto distribution with α = 1.1. Empirically, the top 1% of queries drive ~70% of traffic and the remaining 99% form a long tail. You can collect at most 2,000 human relevance judgments/week for query–document pairs (graded 0–3).
Goal: estimate the change in NDCG@10 between a baseline and a new ranker with ±0.01 absolute precision at 95% confidence for:
Also ensure a guardrail: the long-tail NDCG@10 must not degrade by ≥0.02.
Tasks:
Login required