This question tests a data scientist's ability to apply McNemar's test for comparing paired classifiers, a core competency in statistical hypothesis testing and model evaluation. It assesses practical knowledge of contingency table analysis, exact vs. asymptotic tests, confidence interval construction, and multiple testing correction in machine learning contexts.
You evaluated two classifiers A and B on the SAME 10,000 labeled examples. The paired outcomes are:
- Both correct n11 = 8,740; both wrong n00 = 740; A correct/B wrong n10 = 300; A wrong/B correct n01 = 220.
Answer:
1) Using McNemar's test with continuity correction, compute the test statistic and p-value for H0: error rates are equal. Show intermediate numbers (b, c, |b−c|, b+c).
2) Compute the exact binomial p-value for the same H0 using b+c trials. Explain when you prefer the exact test.
3) Give a 95% confidence interval for the accuracy difference (A−B) on paired data; state which method you use and why.
4) Discuss assumptions, when McNemar's test is inappropriate, and how you'd adjust if you compare A against 10 models (multiple testing control).
Quick Answer: This question tests a data scientist's ability to apply McNemar's test for comparing paired classifiers, a core competency in statistical hypothesis testing and model evaluation. It assesses practical knowledge of contingency table analysis, exact vs. asymptotic tests, confidence interval construction, and multiple testing correction in machine learning contexts.
Paired Comparison of Two Classifiers via McNemar's Test
You evaluated two classifiers, A and B, on the same 10,000 labeled examples. Because both models see identical inputs, the natural unit of analysis is the paired outcome per example. The paired results are summarized in a 2×2 contingency table:
B correct
B wrong
A correct
n11=8,740
n10=300
A wrong
n01=220
n00=740
Define the two discordant counts:
b=n10=300
— A correct, B wrong
c=n01=220
— A wrong, B correct
The concordant cells (n11, n00) are pairs where A and B agree; they carry no information about which model is better and are ignored by the test of interest.
Constraints & Assumptions
N=10,000
examples, each scored by both classifiers (fully paired design, no missing predictions).
Outcome per example is binary: correct vs. incorrect.
Examples are assumed independent and identically distributed (no clustering / duplication unless stated).
Are the 10,000 examples a held-out test set that was
not
used to tune either model's threshold or hyperparameters? (Tuning on the same set invalidates the p-values.)
Are examples truly independent, or are there grouped/correlated items (e.g., multiple sentences from one document, repeated users)?
Is the goal to test
whether
A and B differ, to
quantify
the difference, or both? (This drives whether we need a CI in addition to a p-value.)
For the multiple-comparison part: is the family of 10 comparisons confirmatory (control FWER) or exploratory/screening (control FDR)?
Part 1 — McNemar's test with continuity correction
Test H0: the two classifiers have equal error rates (equivalently, the discordant cells are symmetric, E[b]=E[c]). Using McNemar's test with continuity correction, report the intermediate quantities (b, c, ∣b−c∣, b+c), the test statistic, the reference distribution, the p-value, and your conclusion.
What This Part Should Cover
Correct identification that only
b
and
c
enter the statistic (concordant cells dropped).
Correct continuity-corrected formula, df = 1, and a numerically sound p-value.
A clear reject/fail-to-reject decision tied to
α
, interpreted in terms of which model is more accurate.
Part 2 — Exact conditional (binomial) test
Compute the exact two-sided p-value for the same H0 by conditioning on the total number of discordant pairs b+c. State the exact null distribution you condition on, then explain when you would prefer the exact test over the asymptotic χ2 McNemar test.
Clarifying Questions for this Part
Which two-sided convention is expected — the doubled one-tailed tail (
2×P(X≥b)
) or the "sum of probabilities
≤
the observed point mass" (Fisher) convention? They can differ slightly for asymmetric/discrete cases.
What This Part Should Cover
The conditioning argument that yields
Binomial(b+c,0.5)
under
H0
.
A correct two-sided exact p-value (and awareness of the two-sided convention used).
A principled rule for preferring exact over asymptotic (small
b+c
, sparse cells, strict type-I control), plus mention of the mid-
p
variant as a less conservative compromise.
Part 3 — 95% confidence interval for the paired accuracy difference
A p-value alone does not convey effect size. Provide a 95% confidence interval for the paired accuracy difference δ=Acc(A)−Acc(B). State which method you use and why, and interpret the interval in plain language (percentage points).
What This Part Should Cover
A correct point estimate
δ^=(b−c)/N
and a CI from a named, appropriate method.
A correct variance/derivation for the chosen method (Wald variance formula, or Wilson on the discordant-pair proportion with the back-transform).
Plain-language interpretation in percentage points, and consistency with the Part 1/2 decision (the interval should exclude 0).
Part 4 — Assumptions, failure modes, and multiple testing
Discuss the assumptions behind McNemar's test, situations in which it is inappropriate, and how you would adjust the analysis for multiple testing if comparing A against 10 other models on the same test set.
What This Part Should Cover
An accurate list of McNemar's assumptions (pairing, binary outcome, independence across items, adequate
b+c
for the
χ2
approximation) and the failure mode each violation creates.
Concrete alternatives for each failure mode (two-proportion z-test for unpaired sets; Bowker's test for multi-class; clustered permutation or aggregate-to-cluster for correlated items; held-out split for leakage).
A principled multiplicity adjustment for
m=10
correlated comparisons: FWER vs. FDR framing, at least one named method for each (Holm–Bonferroni or Westfall–Young/maxT for FWER; Benjamini–Hochberg for FDR), and awareness that reuse of the same test set introduces correlation among p-values.
A clear mapping of FWER vs. FDR choice to confirmatory vs. exploratory intent.
What a Strong Answer Covers
Across all four parts, a strong answer demonstrates a coherent thread rather than four disconnected calculations:
Recognizes that
the same discordant counts b,c
drive the test statistic (Part 1), the exact test (Part 2), and the effect size (Part 3) — and that all three should agree on the direction and significance.
Shows numerical fluency: correct continuity-corrected
χ2
, a correct exact binomial tail, and an effect-size interval that excludes 0, all reported with appropriate precision.
Connects statistics to ML practice: paired evaluation on a
held-out
set, the danger of test-set leakage/tuning, and effect size vs. statistical significance for model-selection decisions.
Handles multiplicity correctly, choosing FWER vs. FDR by intent and acknowledging correlation among comparisons that share a dataset.
Follow-up Questions
The corrected
χ2
and the exact test give nearly identical p-values here. Construct a scenario (specific
b,c
) where they would
diverge meaningfully
, and say which you would trust.
Suppose the 10,000 examples are actually 2,000 documents with 5 sentences each. How does this change the validity of the test, and what would you do instead?
How would the analysis change if you cared about a
metric other than accuracy
(e.g., F1 or AUC), where outcomes are not a simple per-example correct/incorrect bit?
If A is only better by 0.8 percentage points but the cost of switching models is high, how would you frame the decision beyond the hypothesis test?