This question evaluates a candidate's competency in statistical model evaluation for rare-event detection, covering selection of appropriate metrics for imbalanced data, estimation of confidence intervals and hypothesis testing with small samples, and considerations such as calibration, thresholding, paired versus unpaired comparisons, and cost/alert-budget trade-offs. Commonly asked in Machine Learning and data-science interviews, it tests both conceptual understanding of statistical assumptions and practical application of limited aggregated results to quantify uncertainty and support decision-making, and is categorized under Machine Learning and statistical inference with emphasis on both conceptual and practical levels of abstraction.
You are evaluating two models (Model A and Model B) for rare-event detection (e.g., fraud, abuse, medical adverse event). Positives are extremely rare.
You are given only limited evaluation results (e.g., a small number of aggregated counts such as TP/FP/FN/TN, or precision/recall at a chosen threshold) for each model—assume you have enough information to derive confusion-matrix counts, but the number of positives is small.
State your assumptions (paired vs unpaired evaluation, fixed threshold vs full curve, etc.).