Compare-xAI: Benchmarking XAI Methods
- Compare-xAI is a comprehensive benchmarking framework that standardizes the evaluation of post-hoc XAI methods through functional tests that directly address user needs.
- It employs a three-level hierarchical scoring system to deliver detailed performance insights for researchers, practitioners, and lay users.
- The interactive web interface visualizes algorithm strengths and weaknesses, enabling transparent and responsible deployment of explainability techniques.
Compare-xAI is a comprehensive, standardized benchmarking framework for the empirical evaluation and ranking of post-hoc explainable artificial intelligence (XAI) methods. Its design integrates diverse functional tests, interpretable hierarchical scoring, and an interactive user interface to facilitate fair, end-to-end comparison of local feature-attribution algorithms. The framework addresses the proliferation of ad hoc evaluation practices in XAI by providing a consistent protocol in which every functional test target maps onto a concrete end-user need, and every metric ties explicitly to widely debated desiderata in the explainability literature (Belaid et al., 2022).
1. Motivation and Principles of Compare-xAI
Compare-xAI was developed in response to two fundamental challenges: the overabundance of XAI algorithms and evaluation metrics, and the frequent mismatch between the empirical properties of explanations and actual end-user requirements. The framework is grounded in the recognition that many data scientists misuse or over-interpret attribution maps, partially because existing evaluation methods do not directly interrogate the desiderata—such as fidelity, robustness, or simplicity—that matter for practical adoption or regulatory compliance.
Unlike prior benchmarks that either depend on ground-truth attributions (which rarely exist outside synthetic domains) or focus narrowly on downstream human tasks, Compare-xAI unifies a carefully curated set of "functional tests" drawn from the literature. Each functional test uniquely targets a single user-centric requirement and avoids redundancy or controversial ground truth assertions (Belaid et al., 2022).
2. Functional Test Selection and Categorization
The functional test suite in Compare-xAI is constructed according to six stringent criteria:
- Every test probes one explicit user requirement, ensuring interpretability of failures.
- Tests are admissible only if there is consensus in the literature about their purpose and interpretation.
- Only pattern-based ground truths are used, e.g., requiring only that feature A's importance exceeds feature B's, not exact attribution values.
- Tests are non-redundant, meaning that any two tests with perfectly correlated outcomes across all algorithms are merged.
- The suite is limited to tests with literature precedent, ensuring that all included tests have undergone community scrutiny.
- Tests are grouped into five major categories:
- Fidelity: Does the explanation respect causal relationships between inputs and model outputs?
- Fragility: Is it robust to small perturbations or adversarial attacks?
- Stability: Does the explanation remain consistent across retraining, random seeds, or data subsampling?
- Simplicity: Is the explanation parsimonious and easily interpretable?
- Stress: Does the method falter or degrade gracefully under edge cases or high complexity?
The first release covers 22 tests drawn from over 40 publications, with each test mapping uniquely to one of these five categories (Belaid et al., 2022).
3. Hierarchical Scoring and Interpretation
Compare-xAI adopts a three-level hierarchical scoring framework that aligns with different user needs:
- Level 1 (Researcher): The raw score vector for algorithm on each individual functional test.
- Level 2 (Practitioner): Category-level averages , yielding a 5-dimensional interpretability profile (fidelity, fragility, stability, simplicity, stress).
- Level 3 (Layman/Executive): The single "comprehensibility" index , which captures how likely the method is to yield reliable, correctly interpretable outputs without the user needing repeated manual checks.
All metrics are standardized on , and partial credit is assigned for approximate passes (e.g., if feature A ranks second out of four, credit is $0.75$). This approach replicates realistic, non-expert application scenarios, explicitly avoiding fine-tuning or repeated trials per test to match how unfamilied practitioners typically interact with XAI tools (Belaid et al., 2022).
4. Benchmarking Results and Case Application
A proof-of-concept benchmarking of 16 algorithms (including Exact Shapley, Kernel SHAP, LIME, permutation importance, TreeSHAP, SAGE, MAPLE) on 22 tests produced actionable quantitative profiles for all evaluated methods. Notably:
| Method | Time (s)/test | Fidelity (%) | Fragility (%) | Simplicity (%) | Stability (%) | Stress (%) | Comprehensibility () |
|---|---|---|---|---|---|---|---|
| Kernel SHAP | 328 | 100 | 11 | 100 | XX | XX | 79 |
| Baseline_random | -- | XX | XX | XX | XX | XX | 32 |
("XX" denotes data not provided in the summary; the qualitative trend is that high-fidelity methods can lag on fragility or stability.)
Kernel SHAP achieved perfect scores in Fidelity and Simplicity but lagged in Fragility, demonstrating that no single method dominates across all dimensions. The boxplot of test results revealed that half the evaluated algorithms fully failed at least 17.6% of the functional tests and partially failed another 38.7%, underscoring the prevalence of weaknesses even among widely adopted methods.
The web-based interface visualizes algorithms as Pareto frontiers in the space of comprehensibility vs. execution time, with additional metadata on model class support and method requirements.
5. Interactive User Interface and Usage
The Compare-xAI web interface provides real-time, multi-scalar exploration:
- Scatterplots of method runtime vs. comprehensibility, with the dot size encoding method portability.
- Filtering by model class, explanation type (global/local), and metric category.
- Drill-down to see per-test results for researchers, five-category bar charts for practitioners, and at-a-glance metrics for lay users.
- Detailed reports include supported model families, required arguments, test-by-test performance breakdown, Pareto-front status, known vulnerabilities, and output interpretability guidance.
This interface is designed to mitigate the risk of misinterpretation by clearly surfacing which dimensions each method excels or fails on, thereby guarding against inappropriate "black-box" faith in explanation outputs (Belaid et al., 2022).
6. Limitations and Extensions
The current version of Compare-xAI is restricted to:
- Quantitative, pattern-based metrics that do not require contested ground-truth attributions.
- Feature-importance-based local explainers applied to supervised classification and regression problems.
- Omission of human-grounded subjective evaluation and overload analysis, though extensions are anticipated.
- Absence of support for RL, GANs, or unsupervised models; interaction-attribution tests under development.
Future versions are planned to incorporate repeated-run statistics, parameter sensitivity, hybrid human-machine evaluation, and evaluation of other XAI paradigms (Belaid et al., 2022).
7. Position in the XAI Evaluation Landscape
Compare-xAI differs from prior frameworks by unifying and systematizing functional evaluation into a scalable, interpretable benchmark that is explicitly designed for end-user decision transparency. Its empirical results and visualization pipeline facilitate rapid identification of method-specific strengths and weaknesses, thus guiding both method development and responsible deployment. The benchmark is available at https://karim-53.github.io/cxai/ (Belaid et al., 2022).
By coupling rigorous selection of evaluation tests with a user-centered interface and a hierarchical metric design, Compare-xAI enables principled, reproducible model-agnostic comparison of XAI methods and acts as a de facto standard for empirical assessment in contemporary explainability research.