Should we really use post-hoc tests based on mean-ranks? (1505.02288v1)

Published 9 May 2015 in cs.LG, math.ST, physics.data-an, q-bio.QM, stat.ML, and stat.TH

Abstract: The statistical comparison of multiple algorithms over multiple data sets is fundamental in machine learning. This is typically carried out by the Friedman test. When the Friedman test rejects the null hypothesis, multiple comparisons are carried out to establish which are the significant differences among algorithms. The multiple comparisons are usually performed using the mean-ranks test. The aim of this technical note is to discuss the inconsistencies of the mean-ranks post-hoc test with the goal of discouraging its use in machine learning as well as in medicine, psychology, etc.. We show that the outcome of the mean-ranks test depends on the pool of algorithms originally included in the experiment. In other words, the outcome of the comparison between algorithms A and B depends also on the performance of the other algorithms included in the original experiment. This can lead to paradoxical situations. For instance the difference between A and B could be declared significant if the pool comprises algorithms C, D, E and not significant if the pool comprises algorithms F, G, H. To overcome these issues, we suggest instead to perform the multiple comparison using a test whose outcome only depends on the two algorithms being compared, such as the sign-test or the Wilcoxon signed-rank test.

Citations (351)

View on Semantic Scholar

Summary

The paper demonstrates that post-hoc tests based on mean-ranks yield inconsistent significance between algorithm pairs due to the influence of other algorithms.
Simulation examples show that contextual dependency can distort power differences, leading to paradoxical evaluation results.
Alternative tests such as the Wilcoxon signed-rank and Bayesian methods are proposed for more direct and reliable pairwise algorithm comparisons.

An Evaluation of Post-Hoc Tests Based on Mean-Ranks in Machine Learning

The paper by Benavoli, Corani, and Mangili critiques the applicability of post-hoc tests based on mean-ranks, particularly in the context of machine learning and related fields. The authors argue that the mean-ranks approach, typically employed following a Friedman test, is flawed due to its dependency on the pool of algorithms included in an experiment. This critique is significant across various domains, such as medicine and psychology, where statistical comparisons of algorithms are routinely performed.

Core Argument and Illustrative Examples

The crux of the paper contends that mean-ranks tests may lead to paradoxical outcomes. Specifically, the significance of the difference between two algorithms could vary depending on the presence or absence of other algorithms in the comparison pool. For instance, the paper presents cases where a mean-ranks test declares a difference significant when a particular set of algorithms is considered, but not when another set is involved. This inconsistency is problematic as it implies that the results of such tests are contextually tied to the algorithms' ensemble rather than the actual pairwise differences.

The paper offers detailed examples to substantiate these claims. The authors simulated data under various conditions to show how the mean-ranks test can produce contradictory results based on different algorithmic contexts. For example, power differences between methods such as overlapping distributions (i.e., Algorithms A and B) are shown to be distorted due to the ranks of supplementary control algorithms within the evaluation set.

Recommendations and Alternative Approaches

To mitigate the identified issues, the authors recommend alternative statistical tests that do not depend on other algorithms in the test set, specifically the Wilcoxon signed-rank test and the sign test. These tests compare two algorithms independently from others, eliminating the undesired effects of contextual variation presented by the mean-ranks test. Moreover, the adoption of Bayesian methods for hypothesis testing is proposed, as they circumvent the classical drawbacks of null-hypothesis significance testing.

Implications for Practice and Theory

This paper holds practical implications for researchers in machine learning and its related disciplines, advocating for methodological shifts in assessing algorithmic performance. The recommendation to eschew mean-ranks tests in favor of more stable post-hoc analysis without the influence of additional algorithms is a concrete step toward more robust statistical conclusions. The theoretical underpinnings of algorithm comparison reemphasize the need to focus on direct pairwise performance rather than a derived rank contingent on an ensemble.

Future Directions

While this critique provides a compelling case against mean-ranks post-hoc tests, it opens avenues for further research into the development and validation of new statistical methodologies for algorithm comparison. Future work could explore the detailed practical application of Bayesian testing frameworks across an expanded scope of real-world datasets or focus on further quantifying the effect of different tests on Type I error rates in large-scale experimental contexts.

In conclusion, the paper by Benavoli et al. provides a thorough examination of the limitations inherent in mean-ranks based post-hoc tests and offers viable alternatives for researchers to consider. Transitioning to more dependable methods significantly benefits the accuracy and reliability of algorithmic evaluations, reinforcing the integrity of comparative scientific research.

PDF Markdown