This question evaluates competency in designing composite success metrics, experiment design, and evaluation pipelines for search features, testing skills in metric formulation, calibration to business outcomes, aggregation strategies, handling missing or correlated labels, statistical power and sample-size reasoning, label collection and monitoring within the Analytics & Experimentation domain for Data Scientist roles. It is commonly asked to assess alignment of measurement with business objectives and the ability to reason about trade-offs and biases, and it tests both conceptual understanding of trade-offs and practical application in running online experiments and building offline evaluation pipelines.

A new search feature is evaluated with two binary labels per query: relevancy=1/0 and accuracy=1/0. 1) Propose a composite success metric that uses these two labels. Give the exact scoring rule (e.g., AND vs. weighted score vs. lexicographic), justify the choice under different error costs, and show how you would calibrate weights using business outcomes. 2) Define the aggregation level (query, session, user, or day) and how to handle multiple queries per user, missing labels, and correlated outcomes. 3) Design an online experiment: unit of randomization, primary/guardrail metrics, power and sample-size inputs, pre-registration of analysis, and a plan to mitigate p-hacking across many slices. 4) Propose an offline evaluation pipeline (label collection, inter-rater agreement, golden sets) and explain how you’ll monitor label drift and Simpson’s paradox when segmenting by intent or locale.