Papers
Topics
Authors
Recent
2000 character limit reached

McNemar's Test: Overview & Extensions

Updated 9 January 2026
  • McNemar's test is a non-parametric matched-pair test used to compare paired binary data by focusing on discordant pairs.
  • The test applies chi-square approximations or exact binomial methods, often incorporating continuity corrections for small discordant counts.
  • Extensions using cross-validation, block regularization, and NMAR adjustments improve stability, power, and error control in practical applications.

McNemar's test is a non-parametric matched-pair test used to evaluate the difference between two correlated proportions, typically in the context of paired binary data or when comparing the error patterns of two classification algorithms on the same test set. The test assesses the null hypothesis that the two classifiers (or paired measurements) have identical marginal distributions, focusing on discordant outcomes where the two methods disagree. Over time, extensions and refinements—most notably addressing the instability due to single hold-out splits and complications from missing or dependent data—have broadened its applicability and rigor.

1. Classical Formulation and Hypotheses

Consider nn paired binary outcomes, observed as (X1,i,X2,i)(X_{1,i}, X_{2,i}), i=1,,ni=1,\dots,n. The data are summarized in a 2×22\times2 matched-pairs contingency table with entries n11n_{11} (both correct), n10n_{10} (method 1 correct, method 2 wrong), n01n_{01} (method 1 wrong, method 2 correct), n00n_{00} (both wrong). The null hypothesis states that the pairwise marginal probabilities are equal:

H0:p10=p01,H_0: p_{10} = p_{01}\,,

interpreted (in classifier comparison) as two algorithms having equivalent error rates on the data distribution. The McNemar test exclusively targets the “discordant pairs” (n10n_{10} and n01n_{01}).

The classical test statistic (chi-square approximation) is:

χMcN2=(n10n01)2n10+n01\chi^2_{\mathrm{McN}} = \frac{(n_{10} - n_{01})^2}{n_{10} + n_{01}}

For small sample sizes, a continuity correction is standard:

χMcN,corr2=(n10n011)2n10+n01\chi^2_{\mathrm{McN},\mathrm{corr}} = \frac{(|n_{10} - n_{01}| - 1)^2}{n_{10} + n_{01}}

Alternatively, the exact binomial (sign) test under the null treats n10Binomial(n10+n01,0.5)n_{10} \sim \operatorname{Binomial}(n_{10} + n_{01},\, 0.5), yielding a two-sided pp-value as 2min{P(Xmin(n10,n01)),P(Xmax(n10,n01))}2\min\{P(X \leq \min(n_{10},n_{01})), P(X \geq \max(n_{10},n_{01}))\} (Yang et al., 2023, Mohammadi et al., 2017).

2. Statistical Properties and Modes of Application

The classical McNemar test assumes:

  • A fixed and sufficiently large test set, with independent observations given the test set.
  • No missing data, or at minimum, missingness completely at random.

In application to system comparison (e.g., ontology alignment), the contingency table can be constructed by several conventions—Method A (recall-focused, ignoring false positives) and Method B (accounts for false positives, akin to FF-measure). The selection affects the interpretation of the discordant counts and the sensitivity of the test to different error types (Mohammadi et al., 2017). For small discordant counts, the binomial exact or mid-pp corrections are recommended due to the breakdown of the chi-square approximation.

3. Extensions via Cross-Validation and Block Regularization

A critical limitation of the hold-out McNemar test is its instability from split-to-split variance in n10,n01n_{10}, n_{01} counts. This can yield unreliable pp-values and poor type I error control. To address this, repeated cross-validation, particularly the 5×25\times2 cross-validated (CV) scheme, is recommended (Yang et al., 2023).

Block-Regularized 5×25\times2 CV McNemar's Test

The 5×25\times2 CV procedure consists of five independent two-fold splits, producing ten 2×22\times2 tables. However, these tables are correlated due to shared samples in training sets. Block regularization enforces that training-set overlaps across folds are exactly n/4n/4, leading to controlled covariances among error estimates.

An “effective” contingency table CeC_e is constructed by aggregating the fold-level counts nˉij\bar n_{ij} and rescaling by a variance inflation factor 101+ρ1+8ρ2\frac{10}{1+\rho_1+8\rho_2}, where ρ1\rho_1 and ρ2\rho_2 are correlation coefficients. The effective discordant counts n10,en_{10,e}, n01,en_{01,e} then form the basis of the McNemar statistic:

χ5×2BCV2=(n10,en01,e)2n10,e+n01,e\chi^2_{\mathrm{5\times2\,BCV}} = \frac{(n_{10,e} - n_{01,e})^2}{n_{10,e} + n_{01,e}}

or its continuity-corrected version.

Empirical studies demonstrate that the 5×25\times2 BCV McNemar test maintains type I error at the nominal level and has significantly improved power over single-split McNemar, especially with small to moderate differences in classifier performance. This approach outperforms naïve KK-fold McNemar variants, which ignore fold correlations and may produce misleading pp-values (Yang et al., 2023).

4. Handling Correlation, Equivalence, and Acceptance Regions

The behavior and size/power properties of McNemar’s test are sensitive to the correlation between binary marginal responses. In the context of equivalence (symmetry) testing for correlated bivariate binary data, the “margin test” provides an alternative, deriving a confidence region for (n10,n01)(n_{10}, n_{01}) based on their joint distribution under a specified correlation ρ\rho (Huang, 2022).

The margin test’s acceptance region is generally larger (less conservative) than McNemar’s, especially at higher sample sizes and lower ρ\rho. Type I error for McNemar increases with nn and decreases in ρ\rho, but remains more liberal than the margin test. The acceptance regions for both tests are similarly shaped; the margin test wraps the McNemar band, and any McNemar-accepted point is accepted by the margin test, but not vice versa. The practical implication is that the margin test yields slightly lower power near the symmetry boundary but better type I error control (Huang, 2022).

5. McNemar’s Test with Nonignorable Missingness

Standard McNemar test validity presumes that missing data is at random. When missingness is nonignorable (NMAR), as when inclusion in analysis depends on unobserved outcomes, direct application of McNemar’s test induces bias in both type I error and power.

Latent-variable models address this by explicitly modeling the missing process. Three models of selection-inducing missingness (Model (a)—conditionally independent, Model (b)—saturated logistic, Model (c)—pattern-mixture) are posited, each parameterizing the joint distribution and the missing data mechanism. Hypotheses are nested; a likelihood-ratio deviance statistic

G2(s)=G2(s,)G2()G^2(s|\ast) = G^2(s,\ast) - G^2(\ast)

is computed under models with and without the symmetry constraint π12=π21\pi_{12} = \pi_{21}, yielding an asymptotic χ12\chi^2_1 test (Tahata et al., 2023). Closed-form EM updates permit implementation. Simulation studies confirm this approach controls type I error and achieves higher power than naïve McNemar on complete cases in the presence of NMAR.

6. Multiple Comparisons and Graphical Interpretation

In studies comparing multiple classifiers or systems, multiple pairwise McNemar tests inflate the family-wise error rate. Classical procedures (Bonferroni, Holm, Hochberg) provide pp-value correction for n×1n\times1 or n×nn\times n comparison settings (Mohammadi et al., 2017). In the all-pairs setting, stepwise corrections (Nemenyi, Shaffer, Bergmann) improve power. Pairwise significance can be effectively visualized as a directed graph, with edges representing significant outperformance after correction.

7. Practical Considerations and Limitations

  • The McNemar test’s accuracy diminishes when discordant counts are small; exact or mid-pp approaches are then required.
  • Single hold-out splits are sensitive to sample allocation; cross-validation, particularly the block-regularized 5×25\times2 BCV, is recommended for algorithm comparison.
  • When missing responses may be NMAR, latent-variable models coupled with likelihood-ratio tests supersede McNemar for valid inference.
  • When correlation or equivalence testing is of interest, the margin test offers more conservative error control compared to McNemar.
Feature Classical McNemar 5×2 BCV McNemar Margin Test
Test Data Fixed single hold-out Cross-validated, 10 splits Fixed, considers ρ\rho
Statistic χ2\chi^2 or exact binomial Aggregated χ2\chi^2 on eff. Joint acceptance region
Handles Correlation (%)? No Yes (ρ1,ρ2\rho_1, \rho_2) Yes (ρ\rho param.)
Handles NMAR missingness? No No No
Acceptance region size Smallest Moderate Largest
Type I error control Good (b+cb+c large) Good, stable Most conservative

The selection of McNemar variant depends on sample size, data structure, need for stable inference (classified algorithm comparison), and presence of nonignorable missingness. Proper adjustment for multiple testing is essential in large-scale comparison studies. Results from recent work confirm the continued relevance and adaptability of McNemar’s test in modern applied statistical settings (Yang et al., 2023, Huang, 2022, Tahata et al., 2023, Mohammadi et al., 2017).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to McNemar's Test.