EXP: Explainability Score for AI Transparency
- EXP is a metric that quantifies AI explainability by assessing how transparent and interpretable a model's decision-making process is.
- It integrates machine-centric methods (e.g., impact scores and coverage metrics) with questionnaire and language-model-based techniques to ensure robust evaluation.
- EXP is used for benchmarking XAI methods, calibrating explainability standards, and correlating with user effectiveness across diverse AI domains.
Explainability Score (EXP) is a quantitative or structured qualitative metric designed to assess the degree to which AI or ML systems render their internal decision-making transparent and understandable to various stakeholders. EXP scores are motivated by the need for rigorous, repeatable evaluation of explainability across diverse AI models, domains, and explanation modalities.
1. Formal Definitions and Variants of EXP
Multiple formulations of EXP have been proposed, each grounded in distinct theoretical and methodological traditions:
A. Machine-Centric Impact Score and Impact Coverage
Lin et al. (Lin et al., 2019) introduce a composite machine-centric EXP metric for image classification systems, constructed from two fundamental measures:
- Impact Score (): For input and classifier , an explainability method returns a binary mask identifying “critical factors.” The ablated input is classified as . With threshold , is defined as:
where 0 is the predicted label and 1 the confidence. 2 counts only label flips.
- Impact Coverage (3): For adversarial perturbations 4, measures IoU overlap between 5 and mask 6:
7
- Composite EXP: A weighted sum (after normalization),
8
with 9 balancing general and adversarial scenarios.
B. Questionnaire and Factor-Analysis-Based EXP
Chen & Eickhoff (Chen et al., 2023) operationalize EXP for information retrieval via a weighted sum over latent explainability factors derived from a 19-item user questionnaire:
0
where 1 is the set of factors, 2 and 3 the respective factor and item loadings, 4 the user’s response, and values normalized to 5.
C. Degree of Explainability (DoX/EXP) via Information Pertinence
Sovrano & Vitali (Sovrano et al., 2021) define EXP as the average coverage of archetypal question-aspect pairs by pertinent details extracted from explanation texts, estimated via LLM embeddings and cosine similarity. For details 6, aspects 7, question templates 8, and content 9:
0
where 1 is the pertinence of detail 2 to question 3.
2. Factor Structure and Measurement Dimensions
The explainability captured by EXP is inherently multidimensional. Chen & Eickhoff (Chen et al., 2023) identify six core factors through exploratory factor analysis:
| Factor | Group | Description |
|---|---|---|
| Global Interpretability | Roadblocks | Understanding overall system logic |
| Local Interpretability | Roadblocks | Grasping why specific results were returned |
| Transparency | Utility | Visibility of internal decision signals |
| Justification | Utility | Perceived soundness of provided reasons |
| Granularity | Utility | Appropriateness of explanation detaillevel |
| Sufficiency | Utility | Adequacy of information to enable actionable insight |
Each factor aggregates specific survey items, with empirically determined item and factor weights.
3. Algorithmic Implementations
A. Machine-Centric Procedures (Vision)
- Identify salient region 4 per explanation method 5 for each test input 6.
- Ablate 7 to produce 8.
- Record change in predicted label (9) and/or drop in confidence by 0 (1).
- For adversarial inputs, compute IoU of 2 with patch 3 (4).
- Normalize and aggregate as 5.
B. Language-Model-Based EXP (DoXpy Pipeline)
- Extract subject–predicate–object “details” 6 using dependency parsing from explanation text 7.
- For each aspect 8 and question archetype 9, form 0, embed with a sentence encoder.
- Compute pertinence 1; filter details by threshold 2 and redundancy 3.
- Sum pertinence scores to yield per-question, per-aspect, and final EXP as described above.
C. Questionnaire Scoring Pipeline
- Administer structured questionnaire to users post-system interaction.
- Calculate weighted sum of responses per factor; aggregate and normalize for final 4.
4. Empirical Validation and Interpretation
A. Model Impact Analysis (Vision)
On ResNet-50/ImageNet, Impact Score 5 quantifies the causal salience of regions picked out by XAI methods:
- GSInquire: 6 (confidence/label strongly affected by ablating mask)
- Expected Gradients: 7
- SHAP: 8
- LIME: 9
GSInquire identified adversarial patches with 0 up to 1–2, while LIME was rarely aligned (3–4).
B. Human Alignment Studies
In retrieval (Chen et al., 2023) and domain tasks (Sovrano et al., 2021), EXP scores calculated post-explanation correlate with human utility and effectiveness:
- In IR, EXP distinguished systems with and without transparent signal visualizations (5 vs 6).
- In both finance and healthcare, increase in EXP for more comprehensive XAI corresponded to higher user task effectiveness, with statistical significance 7.
C. Qualitative Checklist Approaches
Winikoff et al. (Winikoff et al., 14 Feb 2025) provide a structured scoresheet covering source code availability, explanation veracity, global/local explanation features, concepts, and automation, functioning as a multidimensional rubric rather than a scalar score.
5. Scope, Generalization, and Limitations
Strengths
- Machine-centric EXP: Does not require human annotations or visual inspection; measures model response to perturbation directly (Lin et al., 2019).
- Factor-based and DoX EXP: Captures breadth and relevance of explanations in terms of archetypal queries and user-elicited dimensions (Chen et al., 2023, Sovrano et al., 2021).
- Empirical alignment: Correlates with user task success and subjective perception of understanding.
Limitations
- Dependence on perturbation operator: Machine-centric 8 is sensitive to how regions are ablated; other masking types may alter results (Lin et al., 2019).
- Thresholds and normalization: Choices such as 9 or item/factor weighting affect sensitivity and comparability.
- Domain specificity: Vision-centric metrics may not transfer directly to NLP or recommendation without redefining "deletion" or "mask."
- Coverage vs. faithfulness: EXP quantifies the presence and informativeness of explanations but does not guarantee their correctness or faithfulness to model internals (Sovrano et al., 2021).
- Scoresheet limitations: The qualitative approach (Winikoff et al., 14 Feb 2025) ensures broad coverage but lacks aggregation or comparability unless a custom scoring rubric is imposed.
6. Practical Applications and Extensions
- Benchmarking XAI algorithms: Machine-centric and DoX-based EXP can be used to empirically compare fidelity and informativeness across explanation algorithms (e.g., LIME, SHAP, GSInquire, TreeSHAP) (Lin et al., 2019, Sovrano et al., 2021).
- Calibration of explainability requirements: Scoresheets serve to map stakeholder needs onto system features and support traceable, standardized evaluation (Winikoff et al., 14 Feb 2025).
- Extension to new domains: EXP can be adapted to recommender systems, QA, and others by recalibrating questions, aspects, and item/factor structure (Chen et al., 2023).
- Integration with fidelity metrics: Combining user-centric and machine-centric scores may yield composite measures of both explainability and faithfulness (Chen et al., 2023).
7. Future Directions and Open Issues
Potential advancements include:
- Dynamic and human-calibrated weighting: Learning weighting parameters (e.g., 0 in machine-centric EXP or factor weights in SSE) from empirical user or task data (Lin et al., 2019, Chen et al., 2023).
- Perturbation robustness: Employing smoother or domain-appropriate perturbations to better assess actual criticality of identified features.
- Multi-modal, multi-level explainability: Extending algorithms and rubrics to support sequence models, structured data, and mixed modalities.
- Faithfulness–explainability tradeoffs: Investigating relationships between scores measuring coverage and informativeness (EXP) and those quantifying faithfulness to underlying decision logic.
Explainability Score (EXP), therefore, represents a family of metrics and frameworks—quantitative, factor-based, or rubric-driven—that enable standardized, reproducible, and context-sensitive evaluation of AI system explainability, each with distinct methodological foundations and domains of applicability (Lin et al., 2019, Chen et al., 2023, Sovrano et al., 2021, Winikoff et al., 14 Feb 2025).