McNemar's Test: Overview & Extensions
- McNemar's test is a non-parametric matched-pair test used to compare paired binary data by focusing on discordant pairs.
- The test applies chi-square approximations or exact binomial methods, often incorporating continuity corrections for small discordant counts.
- Extensions using cross-validation, block regularization, and NMAR adjustments improve stability, power, and error control in practical applications.
McNemar's test is a non-parametric matched-pair test used to evaluate the difference between two correlated proportions, typically in the context of paired binary data or when comparing the error patterns of two classification algorithms on the same test set. The test assesses the null hypothesis that the two classifiers (or paired measurements) have identical marginal distributions, focusing on discordant outcomes where the two methods disagree. Over time, extensions and refinements—most notably addressing the instability due to single hold-out splits and complications from missing or dependent data—have broadened its applicability and rigor.
1. Classical Formulation and Hypotheses
Consider paired binary outcomes, observed as , . The data are summarized in a matched-pairs contingency table with entries (both correct), (method 1 correct, method 2 wrong), (method 1 wrong, method 2 correct), (both wrong). The null hypothesis states that the pairwise marginal probabilities are equal:
interpreted (in classifier comparison) as two algorithms having equivalent error rates on the data distribution. The McNemar test exclusively targets the “discordant pairs” ( and ).
The classical test statistic (chi-square approximation) is:
For small sample sizes, a continuity correction is standard:
Alternatively, the exact binomial (sign) test under the null treats , yielding a two-sided -value as (Yang et al., 2023, Mohammadi et al., 2017).
2. Statistical Properties and Modes of Application
The classical McNemar test assumes:
- A fixed and sufficiently large test set, with independent observations given the test set.
- No missing data, or at minimum, missingness completely at random.
In application to system comparison (e.g., ontology alignment), the contingency table can be constructed by several conventions—Method A (recall-focused, ignoring false positives) and Method B (accounts for false positives, akin to -measure). The selection affects the interpretation of the discordant counts and the sensitivity of the test to different error types (Mohammadi et al., 2017). For small discordant counts, the binomial exact or mid- corrections are recommended due to the breakdown of the chi-square approximation.
3. Extensions via Cross-Validation and Block Regularization
A critical limitation of the hold-out McNemar test is its instability from split-to-split variance in counts. This can yield unreliable -values and poor type I error control. To address this, repeated cross-validation, particularly the cross-validated (CV) scheme, is recommended (Yang et al., 2023).
Block-Regularized CV McNemar's Test
The CV procedure consists of five independent two-fold splits, producing ten tables. However, these tables are correlated due to shared samples in training sets. Block regularization enforces that training-set overlaps across folds are exactly , leading to controlled covariances among error estimates.
An “effective” contingency table is constructed by aggregating the fold-level counts and rescaling by a variance inflation factor , where and are correlation coefficients. The effective discordant counts , then form the basis of the McNemar statistic:
or its continuity-corrected version.
Empirical studies demonstrate that the BCV McNemar test maintains type I error at the nominal level and has significantly improved power over single-split McNemar, especially with small to moderate differences in classifier performance. This approach outperforms naïve -fold McNemar variants, which ignore fold correlations and may produce misleading -values (Yang et al., 2023).
4. Handling Correlation, Equivalence, and Acceptance Regions
The behavior and size/power properties of McNemar’s test are sensitive to the correlation between binary marginal responses. In the context of equivalence (symmetry) testing for correlated bivariate binary data, the “margin test” provides an alternative, deriving a confidence region for based on their joint distribution under a specified correlation (Huang, 2022).
The margin test’s acceptance region is generally larger (less conservative) than McNemar’s, especially at higher sample sizes and lower . Type I error for McNemar increases with and decreases in , but remains more liberal than the margin test. The acceptance regions for both tests are similarly shaped; the margin test wraps the McNemar band, and any McNemar-accepted point is accepted by the margin test, but not vice versa. The practical implication is that the margin test yields slightly lower power near the symmetry boundary but better type I error control (Huang, 2022).
5. McNemar’s Test with Nonignorable Missingness
Standard McNemar test validity presumes that missing data is at random. When missingness is nonignorable (NMAR), as when inclusion in analysis depends on unobserved outcomes, direct application of McNemar’s test induces bias in both type I error and power.
Latent-variable models address this by explicitly modeling the missing process. Three models of selection-inducing missingness (Model (a)—conditionally independent, Model (b)—saturated logistic, Model (c)—pattern-mixture) are posited, each parameterizing the joint distribution and the missing data mechanism. Hypotheses are nested; a likelihood-ratio deviance statistic
is computed under models with and without the symmetry constraint , yielding an asymptotic test (Tahata et al., 2023). Closed-form EM updates permit implementation. Simulation studies confirm this approach controls type I error and achieves higher power than naïve McNemar on complete cases in the presence of NMAR.
6. Multiple Comparisons and Graphical Interpretation
In studies comparing multiple classifiers or systems, multiple pairwise McNemar tests inflate the family-wise error rate. Classical procedures (Bonferroni, Holm, Hochberg) provide -value correction for or comparison settings (Mohammadi et al., 2017). In the all-pairs setting, stepwise corrections (Nemenyi, Shaffer, Bergmann) improve power. Pairwise significance can be effectively visualized as a directed graph, with edges representing significant outperformance after correction.
7. Practical Considerations and Limitations
- The McNemar test’s accuracy diminishes when discordant counts are small; exact or mid- approaches are then required.
- Single hold-out splits are sensitive to sample allocation; cross-validation, particularly the block-regularized BCV, is recommended for algorithm comparison.
- When missing responses may be NMAR, latent-variable models coupled with likelihood-ratio tests supersede McNemar for valid inference.
- When correlation or equivalence testing is of interest, the margin test offers more conservative error control compared to McNemar.
| Feature | Classical McNemar | 5×2 BCV McNemar | Margin Test |
|---|---|---|---|
| Test Data | Fixed single hold-out | Cross-validated, 10 splits | Fixed, considers |
| Statistic | or exact binomial | Aggregated on eff. | Joint acceptance region |
| Handles Correlation (%)? | No | Yes () | Yes ( param.) |
| Handles NMAR missingness? | No | No | No |
| Acceptance region size | Smallest | Moderate | Largest |
| Type I error control | Good ( large) | Good, stable | Most conservative |
The selection of McNemar variant depends on sample size, data structure, need for stable inference (classified algorithm comparison), and presence of nonignorable missingness. Proper adjustment for multiple testing is essential in large-scale comparison studies. Results from recent work confirm the continued relevance and adaptability of McNemar’s test in modern applied statistical settings (Yang et al., 2023, Huang, 2022, Tahata et al., 2023, Mohammadi et al., 2017).