PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Statistics & Math/Microsoft

Test classifier difference with McNemar's test

Last updated: Jun 25, 2026

Quick Overview

This question tests a data scientist's ability to apply McNemar's test for comparing paired classifiers, a core competency in statistical hypothesis testing and model evaluation. It assesses practical knowledge of contingency table analysis, exact vs. asymptotic tests, confidence interval construction, and multiple testing correction in machine learning contexts.

  • medium
  • Microsoft
  • Statistics & Math
  • Data Scientist

Test classifier difference with McNemar's test

Company: Microsoft

Role: Data Scientist

Category: Statistics & Math

Difficulty: medium

Interview Round: Onsite

You evaluated two classifiers A and B on the SAME 10,000 labeled examples. The paired outcomes are: - Both correct n11 = 8,740; both wrong n00 = 740; A correct/B wrong n10 = 300; A wrong/B correct n01 = 220. Answer: 1) Using McNemar's test with continuity correction, compute the test statistic and p-value for H0: error rates are equal. Show intermediate numbers (b, c, |b−c|, b+c). 2) Compute the exact binomial p-value for the same H0 using b+c trials. Explain when you prefer the exact test. 3) Give a 95% confidence interval for the accuracy difference (A−B) on paired data; state which method you use and why. 4) Discuss assumptions, when McNemar's test is inappropriate, and how you'd adjust if you compare A against 10 models (multiple testing control).

Quick Answer: This question tests a data scientist's ability to apply McNemar's test for comparing paired classifiers, a core competency in statistical hypothesis testing and model evaluation. It assesses practical knowledge of contingency table analysis, exact vs. asymptotic tests, confidence interval construction, and multiple testing correction in machine learning contexts.

Related Interview Questions

  • Choose Classification Metrics Under Asymmetric Costs - Microsoft (medium)
  • Use confusion matrix to choose model metric - Microsoft (easy)
  • Compute sample size and analyze A/B results - Microsoft (medium)
  • Compute P(Bag B | red) via Bayes - Microsoft (easy)
|Home/Statistics & Math/Microsoft

Test classifier difference with McNemar's test

Microsoft logo
Microsoft
Oct 13, 2025, 9:49 PM
mediumData ScientistOnsiteStatistics & Math
2
0

Paired Comparison of Two Classifiers via McNemar's Test

You evaluated two classifiers, A and B, on the same 10,000 labeled examples. Because both models see identical inputs, the natural unit of analysis is the paired outcome per example. The paired results are summarized in a 2×2 contingency table:

B correctB wrong
A correctn11=8,740n_{11} = 8{,}740n11​=8,740n10=300n_{10} = 300n10​=300
A wrongn01=220n_{01} = 220n01​=220n00=740n_{00} = 740n00​=740

Define the two discordant counts:

  • b=n10=300b = n_{10} = 300b=n10​=300 — A correct, B wrong
  • c=n01=220c = n_{01} = 220c=n01​=220 — A wrong, B correct

The concordant cells (n11n_{11}n11​, n00n_{00}n00​) are pairs where A and B agree; they carry no information about which model is better and are ignored by the test of interest.

Constraints & Assumptions

  • N=10,000N = 10{,}000N=10,000 examples, each scored by both classifiers (fully paired design, no missing predictions).
  • Outcome per example is binary: correct vs. incorrect.
  • Examples are assumed independent and identically distributed (no clustering / duplication unless stated).
  • Significance level α=0.05\alpha = 0.05α=0.05 unless otherwise specified.

Clarifying Questions to Ask

  • Are the 10,000 examples a held-out test set that was not used to tune either model's threshold or hyperparameters? (Tuning on the same set invalidates the p-values.)
  • Are examples truly independent, or are there grouped/correlated items (e.g., multiple sentences from one document, repeated users)?
  • Is the goal to test whether A and B differ, to quantify the difference, or both? (This drives whether we need a CI in addition to a p-value.)
  • For the multiple-comparison part: is the family of 10 comparisons confirmatory (control FWER) or exploratory/screening (control FDR)?

Part 1 — McNemar's test with continuity correction

Test H0H_0H0​: the two classifiers have equal error rates (equivalently, the discordant cells are symmetric, E[b]=E[c]E[b] = E[c]E[b]=E[c]). Using McNemar's test with continuity correction, report the intermediate quantities (bbb, ccc, ∣b−c∣|b-c|∣b−c∣, b+cb+cb+c), the test statistic, the reference distribution, the p-value, and your conclusion.

What This Part Should Cover

  • Correct identification that only bbb and ccc enter the statistic (concordant cells dropped).
  • Correct continuity-corrected formula, df = 1, and a numerically sound p-value.
  • A clear reject/fail-to-reject decision tied to α\alphaα , interpreted in terms of which model is more accurate.

Part 2 — Exact conditional (binomial) test

Compute the exact two-sided p-value for the same H0H_0H0​ by conditioning on the total number of discordant pairs b+cb+cb+c. State the exact null distribution you condition on, then explain when you would prefer the exact test over the asymptotic χ2\chi^2χ2 McNemar test.

Clarifying Questions for this Part

  • Which two-sided convention is expected — the doubled one-tailed tail ( 2×P(X≥b)2 \times P(X \ge b)2×P(X≥b) ) or the "sum of probabilities ≤\le≤ the observed point mass" (Fisher) convention? They can differ slightly for asymmetric/discrete cases.

What This Part Should Cover

  • The conditioning argument that yields Binomial(b+c, 0.5)\text{Binomial}(b+c,\,0.5)Binomial(b+c,0.5) under H0H_0H0​ .
  • A correct two-sided exact p-value (and awareness of the two-sided convention used).
  • A principled rule for preferring exact over asymptotic (small b+cb+cb+c , sparse cells, strict type-I control), plus mention of the mid- ppp variant as a less conservative compromise.

Part 3 — 95% confidence interval for the paired accuracy difference

A p-value alone does not convey effect size. Provide a 95% confidence interval for the paired accuracy difference δ=Acc(A)−Acc(B)\delta = \text{Acc}(A) - \text{Acc}(B)δ=Acc(A)−Acc(B). State which method you use and why, and interpret the interval in plain language (percentage points).

What This Part Should Cover

  • A correct point estimate δ^=(b−c)/N\hat\delta = (b-c)/Nδ^=(b−c)/N and a CI from a named, appropriate method.
  • A correct variance/derivation for the chosen method (Wald variance formula, or Wilson on the discordant-pair proportion with the back-transform).
  • Plain-language interpretation in percentage points, and consistency with the Part 1/2 decision (the interval should exclude 0).

Part 4 — Assumptions, failure modes, and multiple testing

Discuss the assumptions behind McNemar's test, situations in which it is inappropriate, and how you would adjust the analysis for multiple testing if comparing A against 10 other models on the same test set.

What This Part Should Cover

  • An accurate list of McNemar's assumptions (pairing, binary outcome, independence across items, adequate b+cb+cb+c for the χ2\chi^2χ2 approximation) and the failure mode each violation creates.
  • Concrete alternatives for each failure mode (two-proportion z-test for unpaired sets; Bowker's test for multi-class; clustered permutation or aggregate-to-cluster for correlated items; held-out split for leakage).
  • A principled multiplicity adjustment for m=10m = 10m=10 correlated comparisons: FWER vs. FDR framing, at least one named method for each (Holm–Bonferroni or Westfall–Young/maxT for FWER; Benjamini–Hochberg for FDR), and awareness that reuse of the same test set introduces correlation among p-values.
  • A clear mapping of FWER vs. FDR choice to confirmatory vs. exploratory intent.

What a Strong Answer Covers

Across all four parts, a strong answer demonstrates a coherent thread rather than four disconnected calculations:

  • Recognizes that the same discordant counts b,cb, cb,c drive the test statistic (Part 1), the exact test (Part 2), and the effect size (Part 3) — and that all three should agree on the direction and significance.
  • Shows numerical fluency: correct continuity-corrected χ2\chi^2χ2 , a correct exact binomial tail, and an effect-size interval that excludes 0, all reported with appropriate precision.
  • Connects statistics to ML practice: paired evaluation on a held-out set, the danger of test-set leakage/tuning, and effect size vs. statistical significance for model-selection decisions.
  • Handles multiplicity correctly, choosing FWER vs. FDR by intent and acknowledging correlation among comparisons that share a dataset.

Follow-up Questions

  • The corrected χ2\chi^2χ2 and the exact test give nearly identical p-values here. Construct a scenario (specific b,cb, cb,c ) where they would diverge meaningfully , and say which you would trust.
  • Suppose the 10,000 examples are actually 2,000 documents with 5 sentences each. How does this change the validity of the test, and what would you do instead?
  • How would the analysis change if you cared about a metric other than accuracy (e.g., F1 or AUC), where outcomes are not a simple per-example correct/incorrect bit?
  • If A is only better by 0.8 percentage points but the cost of switching models is high, how would you frame the decision beyond the hypothesis test?
Loading comments...

Browse More Questions

More Statistics & Math•More Microsoft•More Data Scientist•Microsoft Data Scientist•Microsoft Statistics & Math•Data Scientist Statistics & Math

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.