Design long-tail search evaluation under label budget

Q: Design long-tail search evaluation under label budget

This question evaluates a data scientist's competencies in experimental design and analytics, including stratified sampling and Neyman allocation, importance-weighted estimators, counterfactual click-based evaluation with propensity-aware methods, variance estimation and power analysis, and active-learning plus drift-monitoring strategies for long-tail query evaluation. Commonly asked in the Analytics & Experimentation domain to verify the ability to build statistically rigorous, budget-constrained evaluation pipelines that combine limited human labels with logged click data, it tests both conceptual understanding of sampling, causal-identification and statistical guarantees and practical application in variance computation, power calculations, and operational monitoring.

Q: How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

Question

Estimating ΔNDCG@10 With Limited Labels Under a Heavy-Tailed Query Mix

You serve ~100M queries/day. Query frequencies follow a Pareto distribution with α = 1.1. Empirically, the top 1% of queries drive ~70% of traffic and the remaining 99% form a long tail. You can collect at most 2,000 human relevance judgments/week for query–document pairs (graded 0–3).

Goal: estimate the change in NDCG@10 between a baseline and a new ranker with ±0.01 absolute precision at 95% confidence for:

Overall traffic
Long tail (defined as the bottom 50% of queries by frequency)

Also ensure a guardrail: the long-tail NDCG@10 must not degrade by ≥0.02.

Tasks:

Propose a stratified sampling design over query frequency buckets, use Neyman allocation for the judged sample, and give an importance-weighted estimator that is unbiased for both the overall metric and the long-tail metric. Include the variance formula.
Explain how to combine the small judged sample with counterfactual evaluation on clicks via IPS/DR. Specify propensity requirements and how to collect them (e.g., randomized interleaving or a logging policy with known propensities).
Provide a back-of-the-envelope power analysis for detecting a 0.01 difference in NDCG@10, including any variance assumptions you make.
Describe an active-learning loop that targets the long tail to reduce label cost and how you would monitor drift week-over-week.

Design long-tail search evaluation under label budget

Estimating ΔNDCG@10 With Limited Labels Under a Heavy-Tailed Query Mix

Solution

Comments (0)

Design long-tail search evaluation under label budget

Overview

Estimating ΔNDCG@10 With Limited Labels Under a Heavy-Tailed Query Mix

Solution

Comments (0)