Critical Difference Diagrams
- Critical Difference Diagrams are visualization tools that rank algorithms based on average performance across datasets, highlighting non-significant differences with horizontal bars.
- They use Friedman and Nemenyi tests to statistically evaluate algorithm differences but can be unstable due to their reliance on global comparisons and ordinal rankings.
- The Multiple Comparison Matrix (MCM) provides a robust alternative by offering direct pairwise comparisons with mean differences, win/tie/loss counts, and p-values.
Critical difference (CD) diagrams provide a method for visualizing and statistically interpreting the comparative performance of multiple algorithms evaluated across multiple datasets. CD diagrams, first popularized by Demšar (2006), synthesize average ranks and post-hoc statistical tests to convey which algorithms are significantly different in aggregate performance, but notable technical and practical shortcomings have motivated the development of alternative methodologies such as the Multiple Comparison Matrix (MCM) (Ismail-Fawaz et al., 2023).
1. Formal Definition and Construction of Critical Difference Diagrams
Given algorithms evaluated over datasets , let denote the performance of algorithm on dataset . For each dataset , the algorithms are ranked: the best performance receives rank 1, the next best rank 2, and so on. Ties are handled by averaging the ranks among tied entries:
The average rank (AR) for algorithm is then defined as:
0
Algorithms are arranged on a horizontal axis in increasing order of AR within the CD diagram; lower AR values correspond to superior average performance.
Before delineating significant pairwise differences, the Friedman test is applied to assess whether any algorithmic differences are statistically significant beyond random variation:
1
If the Friedman test rejects the null hypothesis of algorithmic equivalence, post-hoc pairwise differences are assessed using the Nemenyi test. The critical difference (CD) quantifies the minimum absolute difference in AR above which two algorithms' performances are considered statistically distinguishable for significance level 2:
3
where 4 is obtained from the Studentized range distribution for 5 groups.
Pairs of algorithms whose AR difference does not exceed CD are connected by horizontal bars, indicating "no significant difference."
2. Technical Weaknesses and Manipulation Vulnerabilities
Three primary weaknesses of CD diagrams have been identified:
1. Instability of Average Ranks:
Average ranks are inherently relative to the entire set of algorithms. Adding or removing even a weak or redundant comparate affects the positions of all algorithms, potentially altering the AR ordering even if the pairwise performances of interest are unchanged. In explicit cases, the rank ordering between two competing algorithms can be reversed simply by modifying background comparates, leading to instability in both the diagram and its inferences.
2. Disregard for Magnitude:
The AR statistic measures only ordinal frequency (how often an algorithm wins) and ignores the scale or practical significance of those differences. An algorithm that marginally outscores a competitor in many datasets but suffers occasional large defeats might rank higher, despite practical inferiority in risk terms.
3. Problems with Multiple Testing Corrections:
When the field replaces Nemenyi's post-hoc test with a set of Wilcoxon signed-rank tests and applies family-wise error rate corrections (Holm, Bonferroni), the significance of any individual pairwise test becomes a function of all p-values from the 6 pairwise comparisons. Consequently, adding or removing algorithms can change the significance status of focal algorithm pairs by transforming the multiple-testing landscape. Concrete cases demonstrate that modifying the set of comparates can flip statistical conclusions about critical pairs (Ismail-Fawaz et al., 2023).
3. The Multiple Comparison Matrix (MCM) Approach
To address the diagram's instability and the confounding effect of background comparates, the Multiple Comparison Matrix (MCM) was proposed. The MCM is an 7 (or 8) matrix representing all pairwise, head-to-head algorithm comparisons in a transparent, self-contained manner.
Each cell in the MCM, corresponding to algorithms 9 (row) and 0 (column), summarizes 1 paired measurements 2 via three statistics:
- Mean difference: 3
- Win/Tie/Loss counts: 4, 5, 6
- Pairwise p-value (e.g., two-sided Wilcoxon signed-rank test): 7
The triplet 8 is annotated within the matrix cell. The p-value, used optionally with significance threshold 9, determines cell highlighting but is specific to each algorithm pair and invariant to the composition of the rest of the matrix.
In pseudocode, MCM construction is as follows:
- Input: performance matrix 0 of size 1 (m algorithms, N tasks)
- Select row and column sets, 2
- For 3:
- Compute 4
- 5
- 6, 7, 8
- 9
- Store cell 0
- Visualize as a heat map, bold cells where 1
4. Comparative Summary: CD Diagram Versus MCM
| Feature | CD Diagram | Multiple Comparison Matrix (MCM) |
|---|---|---|
| Comparison Basis | Global, average rank | Pairwise, direct head-to-head |
| Dependency on Comparate Set | Yes | No |
| Magnitude Information | No (ordinal only) | Yes (mean, win/loss, p-value) |
| Multiple-Testing Correction | Required, affects all pairs | Optional, only per pair |
While CD diagrams offer an immediately interpretable layout of global rank status and highlight groups with indistinguishable ARs, their inferences can be undermined by inclusion/exclusion of unrelated algorithms. MCM's cell-wise pairwise independence eliminates the possibility of such manipulation, ensuring that results for 2 reflect only direct head-to-head data and associated statistical tests.
5. Interpretation in Benchmarking Contexts
CD diagrams remain prevalent—particularly in time-series classification—due to their concise format for summarizing group performance. However, practical benchmarking objectives frequently prioritize stability, interpretability of pairwise relationships, and transparency regarding the magnitude of differences. Use of CD diagrams can mask substantial changes in comparative conclusions triggered by background manipulations, casting doubt on the robustness of their inferences for critical decision making.
The MCM, by contrast, aligns with the practical goal of understanding exactly how each algorithm pair compares, presenting win/loss rates alongside effect size and inferential statistics in a stable, manipulation-resistant format. This paradigm shift is particularly salient for high-stakes model selection, reproducibility audits, and evaluation studies involving dynamic benchmarking sets (Ismail-Fawaz et al., 2023).
6. Implications and Adoption
The analysis of Ismail-Fawaz et al. demonstrates that CD diagrams, despite their popularity for summarizing algorithmic progress, possess vulnerabilities that render their conclusions sensitive to arbitrary choices in comparate inclusion. The MCM methodology was advanced to supplant global rank–based summaries with descriptive, pairwise statistics that are invariant under comparate set manipulations. A plausible implication is that benchmarking best practices may increasingly favor MCM-based reporting, especially in domains where algorithm sets are subject to change or where pairwise stability is prioritized.
While this transition enhances transparency and robustness, the continued use of CD diagrams in some domains suggests ongoing trade-offs between global summarization and pairwise granularity. The adoption of MCM is supported by the availability of public Python implementations and its alignment with reproducibility and interpretability demands in contemporary research (Ismail-Fawaz et al., 2023).