Nemenyi Mean-Ranks Test Overview
- The Nemenyi mean-ranks test is a non-parametric post-hoc procedure that ranks algorithms following a Friedman test to assess pairwise differences.
- It suffers from pool-dependence, where the significance between two algorithms can change solely by altering the pool of comparators.
- Alternative tests such as the sign test and Wilcoxon signed-rank test are recommended to ensure robust pairwise comparisons with controlled error rates.
The Nemenyi mean-ranks test is a widely adopted non-parametric post-hoc procedure for identifying pairwise differences among multiple algorithms after a Friedman test establishes a global effect. Its frequent application in experimental machine learning and related fields—where algorithm performance is compared across multiple data sets—has been called into serious question due to its fundamental pitfalls concerning dependence on the set of comparators. The test is notable for a paradoxical property: the statistical conclusion regarding a fixed algorithm pair can shift solely with changes to the pool of included algorithms. These issues undermine its utility in scientific practice and motivate the use of alternative pairwise tests (Benavoli et al., 2015).
1. Test Definition and Workflow
The Nemenyi mean-ranks procedure follows the Friedman test, which checks the global null hypothesis that all algorithms perform equally across data sets.
- Ranking per data set: For each data set , the algorithms are ranked according to their performance, assigning rank 1 to the best and to the worst. Let denote the rank of algorithm on data set .
- Mean ranks: The mean rank for algorithm is
Under the null, 0 for all 1.
- Critical difference (CD): The difference in mean ranks between algorithms 2 and 3, 4, is compared to a critical threshold,
5
where 6 is a quantile of the Studentized range distribution at level 7 for 8 treatments. For large samples, a normal approximation is typical, with Bonferroni correction to control family-wise error.
- Decision rule: Algorithms 9 and 0 are significantly different at level 1 if 2.
- Multiplicity correction: Since 3 pairwise comparisons are performed, significance thresholds are routinely adjusted using 4 (Bonferroni) or sequential methods such as Holm’s.
2. Pool-Dependence and Paradoxical Behavior
A principal flaw of the mean-ranks test is its pool-dependence: the statistical judgment about any algorithm pair is inherently affected by the set of other algorithms included.
Illustrative Example
Consider algorithms A, B, C, D, E evaluated on 5 data sets, with A and B exhibiting equal “head-to-head” outcomes: each algorithm wins on 10 datasets.
- A vs. B only (6): All pairwise tests (mean-ranks, sign, Wilcoxon) yield 7 (no significance).
- A, B, C, D, E included (8): The mean ranks shift, e.g., 9, 0, 1. For 2, 3, and Bonferroni-corrected 4 (5), 6 and 7, so the test deems A and B significantly different.
This outcome reverses without any change to the values for A and B themselves; merely adding or removing competitors causes a shift from non-significance to significance.
Theoretical Implications
- The mean rank for an algorithm is a function of its performance relative to all algorithms present, not just the pair under direct comparison.
- The test does not satisfy the requirement that pairwise judgment be invariant to irrelevant alternatives.
- Strategic selection or omission of additional algorithms can inflate or deflate observed rank gaps, leading to paradoxical conclusions (Benavoli et al., 2015).
3. Consequences for Scientific Inference
The pool-dependence paradox gives rise to several critical concerns:
- Type I error inflation: Comparing two equivalent algorithms, adding others can artificially increase the mean-rank difference and produce spurious statistical significance.
- Loss of power: For genuinely different algorithms among strong or weak comparators, the increased variance with larger 8 lowers the test’s sensitivity.
- Inconsistent conclusions: Disparate algorithm pools used by different researchers can yield contradictory statements about which algorithm is better.
These deficiencies undermine reproducibility, comparability, and logical integrity in algorithm evaluation. The conclusion of the critical analysis is that the Nemenyi test is unreliable for inferential purposes in machine learning and related empirical disciplines (Benavoli et al., 2015).
4. Recommended Pairwise Alternatives
Post-hoc multiple comparisons should use procedures whose outcome for each pair depends solely on their paired results, not the rest of the pool. Two recommended non-parametric alternatives are:
Sign Test
- Mechanism: Counts the number of data sets where A outperforms B:
9
where ties are discarded or split.
- Null distribution: 0 under 1 (no difference).
- P-value: Computed exactly or using the normal approximation.
Wilcoxon Signed-Rank Test
- Mechanism: Computes 2 for each data set, ignoring ties. Ranks the 3 and calculates 4 (sum of positive ranks).
- Null distribution: Assumes differences are symmetric about zero, with known distribution for 5.
- P-value: Uses exact/computed or normal approximation for moderate 6.
Both tests deliver p-values that depend only on the paired outcomes. For multiple comparisons, controlling family-wise error is handled via Bonferroni, Holm, or sequential step-down procedures (Benavoli et al., 2015).
5. Workflow and Best Practices
Empirical comparisons involving multiple algorithms should adopt the following structure:
- Initial global test: Perform the Friedman test to assess whether a global difference exists among all 7 algorithms.
- Post-hoc pairwise analysis: For significant Friedman test results, use only the Wilcoxon signed-rank test (if symmetric differences plausible) or the sign test (if robust is preferred).
- Multiplicity control: Adjust significance thresholds using Bonferroni or Holm procedures.
- Reporting: Provide exact p-values and explicitly state the multiple comparison method applied.
- Bayesian alternatives: For more informative inference, Bayesian analogues to the signed-rank or sign test may be used to quantify evidence for differences or equivalence, avoiding pitfalls of frequentist p-values.
By eschewing the Nemenyi mean-ranks test in favor of pairwise-only inferential procedures, post-hoc analysis of algorithms ensures that each decision for A vs. B is supported only by direct paired evidence, immune to the inclusion or exclusion of other algorithms in the experimental pool (Benavoli et al., 2015).
6. Summary Table: Post-hoc Methods
| Method | Pool Dependence | Pairwise Only | Assumption on Differences |
|---|---|---|---|
| Nemenyi Mean-Ranks | Yes | No | None |
| Sign Test | No | Yes | None (robust) |
| Wilcoxon Signed-Rank Test | No | Yes | Symmetry of differences |
This table summarizes the defining properties that distinguish pool-agnostic tests from the Nemenyi mean-ranks procedure, highlighting criteria critical for logically sound post-hoc inference (Benavoli et al., 2015).