PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Analytics & Experimentation/Google

Design long-tail search evaluation under label budget

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a data scientist's competencies in experimental design and analytics, including stratified sampling and Neyman allocation, importance-weighted estimators, counterfactual click-based evaluation with propensity-aware methods, variance estimation and power analysis, and active-learning plus drift-monitoring strategies for long-tail query evaluation. Commonly asked in the Analytics & Experimentation domain to verify the ability to build statistically rigorous, budget-constrained evaluation pipelines that combine limited human labels with logged click data, it tests both conceptual understanding of sampling, causal-identification and statistical guarantees and practical application in variance computation, power calculations, and operational monitoring.

  • hard
  • Google
  • Analytics & Experimentation
  • Data Scientist

Design long-tail search evaluation under label budget

Company: Google

Role: Data Scientist

Category: Analytics & Experimentation

Difficulty: hard

Interview Round: Technical Screen

Your search engine serves 100M queries/day. Query frequencies follow Pareto(α=1.1); the top 1% of queries drive 70% of traffic; the remaining 99% are long tail. You can obtain at most 2,000 human relevance judgments/week for query–document pairs (0–3 graded). You want to estimate the change in overall and long-tail NDCG@10 between a baseline and a new ranker with ±0.01 absolute precision (95% CI) and ensure the long-tail (bottom 50% frequency bucket) does not degrade by ≥0.02. - Propose a sampling design (stratification buckets and Neyman allocation) and an importance-weighted estimator that yields unbiased metrics for both overall and long-tail; provide formulas for the estimator and its variance. - Show how you would combine small-scale judged data with counterfactual evaluation on clicks via IPS/DR; specify the propensity requirements and how you’d collect them (e.g., randomized interleaving or logging policy). - Perform a back-of-envelope power analysis for detecting a 0.01 NDCG@10 difference, stating any variance assumptions you need. - Describe an active-learning loop to target the long tail and reduce label cost, and how you’d monitor drift week-over-week.

Quick Answer: This question evaluates a data scientist's competencies in experimental design and analytics, including stratified sampling and Neyman allocation, importance-weighted estimators, counterfactual click-based evaluation with propensity-aware methods, variance estimation and power analysis, and active-learning plus drift-monitoring strategies for long-tail query evaluation. Commonly asked in the Analytics & Experimentation domain to verify the ability to build statistically rigorous, budget-constrained evaluation pipelines that combine limited human labels with logged click data, it tests both conceptual understanding of sampling, causal-identification and statistical guarantees and practical application in variance computation, power calculations, and operational monitoring.

Related Interview Questions

  • Design an A/B test for search ranking - Google (easy)
  • Design an Unbiased Upgrade Experiment - Google (hard)
  • Design a Causal Upgrade Experiment - Google (hard)
  • Design an experiment to measure latency impact - Google (medium)
  • How would you use propensity score matching here - Google (medium)
Google logo
Google
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Analytics & Experimentation
1
0

Estimating ΔNDCG@10 With Limited Labels Under a Heavy-Tailed Query Mix

You serve ~100M queries/day. Query frequencies follow a Pareto distribution with α = 1.1. Empirically, the top 1% of queries drive ~70% of traffic and the remaining 99% form a long tail. You can collect at most 2,000 human relevance judgments/week for query–document pairs (graded 0–3).

Goal: estimate the change in NDCG@10 between a baseline and a new ranker with ±0.01 absolute precision at 95% confidence for:

  • Overall traffic
  • Long tail (defined as the bottom 50% of queries by frequency)

Also ensure a guardrail: the long-tail NDCG@10 must not degrade by ≥0.02.

Tasks:

  1. Propose a stratified sampling design over query frequency buckets, use Neyman allocation for the judged sample, and give an importance-weighted estimator that is unbiased for both the overall metric and the long-tail metric. Include the variance formula.
  2. Explain how to combine the small judged sample with counterfactual evaluation on clicks via IPS/DR. Specify propensity requirements and how to collect them (e.g., randomized interleaving or a logging policy with known propensities).
  3. Provide a back-of-the-envelope power analysis for detecting a 0.01 difference in NDCG@10, including any variance assumptions you make.
  4. Describe an active-learning loop that targets the long tail to reduce label cost and how you would monitor drift week-over-week.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Analytics & Experimentation•More Google•More Data Scientist•Google Data Scientist•Google Analytics & Experimentation•Data Scientist Analytics & Experimentation
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.