Papers
Topics
Authors
Recent
Search
2000 character limit reached

Friedman & Nemenyi Tests Overview

Updated 12 February 2026
  • The Friedman test is a nonparametric omnibus procedure that ranks and compares multiple treatments across datasets to test for performance equivalence.
  • The Nemenyi mean-ranks test serves as a post-hoc method to compare all pairs following Friedman’s test but can yield paradoxical results due to pool-dependence.
  • Alternatives like pairwise tests and S-plots offer pool-independent comparisons and controlled Type I error, enhancing inference clarity.

The Friedman and Nemenyi tests are core nonparametric methodologies for analyzing and comparing multiple treatments or algorithms under a randomized complete block design. Their primary application is the statistical comparison of several methods across multiple data sets, with particular relevance in areas such as machine learning, psychology, and medicine. After an initial omnibus hypothesis test (Friedman), post-hoc analyses such as the Nemenyi mean-ranks test are commonly employed to determine sources of significant differences. However, the dependence of Nemenyi post-hoc inferences on the entire set of treatments and accompanying paradoxes have recently prompted scrutiny and recommendations for alternative pairwise procedures.

1. The Friedman Test: Omnibus Nonparametric Comparison

The Friedman test is used to detect differences among mm algorithms (treatments) evaluated on NN datasets (blocks). For each dataset jj, the outcomes XijX_{ij} (i=1,...,mi=1,...,m) are ranked, yielding RijR_{ij}; average ranks replace raw performance scores, with ties handled via average-ranking. Each algorithm’s sum of ranks is Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}, with mean rank Ri=Ri/N\overline{R}_i = R_i/N. The null hypothesis H0H_0 asserts “all mm algorithms perform equivalently,” that is, X1j=dX2j=d=dXmjX_{1j} \overset{d}{=} X_{2j} \overset{d}{=} \cdots \overset{d}{=} X_{mj} for all jj.

The Friedman statistic is

χF2=12Nm(m+1)i=1mRi23N(m+1).\chi^2_F = \frac{12N}{m(m+1)}\sum_{i=1}^m \overline{R}_i^2 - 3N(m+1).

Alternatively, in terms of RiR_i,

S=12Nm(m+1)i=1m(RiN(m+1)/2)2.S = \frac{12}{N m (m+1)}\sum_{i=1}^m (R_i - N(m+1)/2)^2.

For large NN, χF2\chi^2_F is approximately χ2\chi^2-distributed with m1m-1 degrees of freedom. In practice, the test serves as a robust, nonparametric alternative to repeated-measures ANOVA for arbitrary, not necessarily normal, data (Benavoli et al., 2015, Elamir, 2022).

2. Nemenyi Mean-Ranks Post-hoc Test

Upon rejection of the Friedman test’s omnibus null, the Nemenyi test is traditionally used for all C=m(m1)/2C = m(m-1)/2 algorithm pairs. For any algorithms ii, jj, the mean-rank difference RiRj|\overline{R}_i - \overline{R}_j| forms the test statistic, evaluated versus a critical difference (CD):

CD=qα;m,m(m+1)6N,CD = q_{\alpha; m, \infty} \sqrt{ \frac{m(m+1)}{6N} },

where qα;m,q_{\alpha; m, \infty} is the upper α\alpha–quantile of the Studentized range distribution with mm treatments and infinite degrees of freedom. Algorithms ii and jj are declared significantly different at family-wise level α\alpha if RiRj>CD|\overline{R}_i - \overline{R}_j| > CD. This controls the family-wise error rate across all m(m1)/2m(m-1)/2 comparisons and is operationally analogous to Tukey-Kramer procedures for parametric ANOVA (Benavoli et al., 2015).

3. Critique of the Mean-Ranks Test and Its Dependence on Algorithm Pool

A foundational critique, detailed by Benavoli, Corani, and Mangili (2016) (Benavoli et al., 2015), is that Nemenyi’s mean-ranks test produces decisions for any pair (A,B)(A, B) contingent on the presence, absence, and relative ordering of all other algorithms involved in the ranking. This can lead to paradoxical scenarios:

  • In an experiment where AA and BB each win on half the cases, two-algorithm tests (sign, Wilcoxon, t-test) yield nonsignificance. However, introducing additional poor-performing algorithms can inflate RARB|\overline{R}_A-\overline{R}_B| sufficiently to exceed CDCD and declare a significant difference.
  • On real-world datasets, the decision on a given pair can flip between significant and non-significant solely due to the composition of the algorithm pool. For example, in UCI data, algorithm pair C2C_2 vs C4C_4 was shown to change significance status depending on which other classifiers were included [(Benavoli et al., 2015), Table 4].

This pool-dependence means mean-ranks tests cannot guarantee control of maximum Type I error when equivalent algorithms are present, as also discussed by Fligner & Killeen (1984).

4. Alternative Two-Algorithm Post-hoc Procedures

To address the pool-dependence flaw, tests evaluating only the paired performances of AA and BB are recommended. These include:

a) Sign Test: For each dataset jj, set dj=+1d_j = +1 if Aj>BjA_j > B_j, 1-1 if Aj<BjA_j < B_j, $0$ if tie. S=S = number of +1+1s among non-ties. Under H0H_0, SBinomial(n,1/2)S \sim \text{Binomial}(n, 1/2). Large nn allows normal approximation:

z=Sn/2n/4z = \frac{S - n/2}{\sqrt{n/4}}

to compare to standard normal quantiles.

b) Wilcoxon Signed-Rank Test: For dataset jj, compute δj=XAjXBj\delta_j = X_{Aj} - X_{Bj}; discard ties. Rank δj|\delta_j| among nonzero values, sum ranks rjr_j for positive δj\delta_j. Under H0H_0 (symmetric differences),

z=Tn(n+1)/4n(n+1)(2n+1)/24z = \frac{T - n(n+1)/4}{\sqrt{n(n+1)(2n+1)/24}}

for large nn. Both tests require family-wise correction (e.g., Bonferroni, Holm) over all m(m1)/2m(m-1)/2 pairs (Benavoli et al., 2015).

5. Recent Developments: S-Statistics and Graphical Interpretation

Recent work (Elamir, 2022) proposes a graphical “S-plot” approach that simultaneously provides the global Friedman test and local post-hoc indications with drastically fewer comparisons. For GG treatments and BB blocks, each treatment gg has a score

Sg=(RgRˉ)2BG(G+1)/12S_g = \frac{(R_g - \bar{R})^2}{ B G (G+1) /12 }

where Rˉ=B(G+1)/2\bar{R} = B(G+1)/2 is the expected rank sum under H0H_0. The sum F=g=1GSgF = \sum_{g=1}^G S_g recovers the classical Friedman statistic.

The distribution of SgS_g is well-approximated via gamma moments matching:

  • E[Sg]E[S_g], Var(Sg)\text{Var}(S_g), and third moment M3(Sg)M_3(S_g) derived from those of FF.
  • Fitted Gamma(aa, β\beta) with a=4/γ1(Sg)2a=4/\gamma_1(S_g)^2, β=a/E[Sg]\beta = a/E[S_g], matching mean and skewness.
  • The threshold DL=QGamma(1αPT;a,β)DL = Q_{\text{Gamma}}(1-\alpha_{PT}; a, \beta) provides Bonferroni-adjusted familywise Type I error.

The S-plot visualizes each SgS_g; treatments with Sg>DLS_g > DL are significant contributors to rejection. This reduces testing from G(G1)/2G(G-1)/2 to GG with controlled error rates and delivers immediate interpretive insight (Elamir, 2022).

6. Empirical Validation and Practical Recommendations

Simulation studies have demonstrated that both the classical Friedman and S-statistic procedures maintain empirical Type I error within Bradley's robustness bounds across a range of GG and BB for both normal and exponential data, with accuracy improving as BB increases. Real-data applications (e.g., class size effects on children’s questions, per Gibbons & Chakraborti) confirm that the S-plot precisely identifies the dominant treatments responsible for global rejection, reducing the reliance on multiple pairwise post-hoc tables (Elamir, 2022).

Practical guidelines:

  • Apply the Friedman test as the omnibus procedure for multiple-treatment, multiple-block designs.
  • Avoid the classical Nemenyi mean-ranks test; its results for a pair may depend irrationally on other treatments present.
  • Prefer pairwise comparisons based exclusively on two-algorithm tests (Wilcoxon signed-rank if symmetry plausible, else sign test), with appropriate correction for multiple comparisons (Benavoli et al., 2015).
  • Consider global-to-local visualization approaches such as S-plots for succinct interpretability and error control.

7. Summary Table: Properties and Critique

Method Pairwise Test Pool Dependence Familywise Error Control Number of Comparisons
Friedman + Nemenyi Yes (dependent) Yes (nominal) O(m2)O(m^2)
Friedman + Pairwise (Wilcoxon/Sign) No (independent) Yes (Bonferroni/Holm) O(m2)O(m^2)
S-Statistic/S-Plot [Editor’s term] No (per-treatment) Yes (Gamma approx, Bonferroni) O(m)O(m)

The core limitation of the mean-ranks test is its statistical dependence on the composition of the entire set of algorithms, which undermines its relevance for pairwise inference. Alternative approaches leveraging either pairwise-only tests or S-statistical visualizations achieve more interpretable, pool-independent, and statistically valid post-hoc inference (Benavoli et al., 2015, Elamir, 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Friedman and Nemenyi Tests.