Explanation Satisfaction Scale
- Explanation Satisfaction Scale is a user-centered measure that quantifies how well users understand AI systems, explanations, and interfaces.
- It utilizes diverse instruments from single-item numeric ratings to multi-item Likert batteries, capturing facets like feasibility, trust, and completeness.
- Empirical findings highlight that tailored scale design and rigorous psychometric validation enhance the reliability and interpretability of user satisfaction assessments.
Explanation Satisfaction Scale refers to a family of user-centered evaluation instruments and criteria developed to quantify the degree to which users feel they understand AI systems, their explanations, and associated user interfaces. These scales serve as subjective outcome measures in explainable AI (XAI) research, particularly to assess users’ mental models, perceived clarity, and suitability of explanations for intended tasks. Although widely adopted, substantial heterogeneity exists in measurement approaches—ranging from ad-hoc single-item scores to rigorously validated multi-item Likert batteries—reflecting disciplinary nuances and task demands. Explanation satisfaction is recognized as a core criterion within broader models of explanation quality but is distinguished from more objective, task-based constructs such as appropriate trust.
1. Formal Definitions and Theoretical Position
The term “Explanation Satisfaction” is defined in the quality evaluation literature as “the degree of how much the users feel they understand the system, the explanations, and the user interface” (Löfström et al., 2022). This definition collectively subsumes related constructs such as explanation goodness, comprehensibility, causability, and the System Causability Scale. In practice, explanation satisfaction operationalizes the subjective impact of an explanation on a user’s mental model, reflecting how well an explainable user interface, algorithmic output, or explanation is perceived as “suitable for the intended purpose” (Löfström et al., 2022).
Within comprehensive models of explanation quality, explanation satisfaction sits squarely in the “user aspect” as a subjective mental-model outcome. It is contrasted with “appropriate trust,” which quantifies a user’s ability to discriminate correct from incorrect system outputs and act accordingly (Löfström et al., 2022). Whereas appropriate trust can be evaluated as an objective task-performance metric, explanation satisfaction remains an intrinsically subjective, introspective measure and, by widely held consensus, lacks firm, generalizable thresholds for acceptability.
2. Measurement Instruments: Scales and Formats
Measurement approaches for explanation satisfaction range from single-item, face-valid ad-hoc questions to multi-item, psychometrically validated scales.
Single-Item Scales:
Recent studies, such as Kaufman et al. (Kaufman et al., 2024), assess explanation satisfaction in situ using a single numeric item:
- Item: “How satisfied are you with the AV’s explanation?”
- Response format: 0–10 slider (0 = “not at all satisfied,” 10 = “completely satisfied”)
- Scoring: Raw scenario-level rating, optionally averaged across scenarios for aggregate analysis
Such single-item measures are direct but lack reliability and factorial validity assessments, and are typically justified for scenario-specific, high-throughput user studies.
Multi-Item Likert Scales:
In-depth studies (e.g., Domnich et al. (Domnich et al., 7 Apr 2025)) employ multi-dimensional batteries, such as the CounterEval “Explanation Satisfaction Scale,” which incorporates both overall satisfaction and specific explanatory virtues:
- Overall satisfaction: “Overall, I am satisfied with this explanation” (1–6 scale)
- Explanatory criteria: Measured individually on 6-point agreement or 5-point complexity scales, covering feasibility (“The suggested changes seem realistic and actionable in this context”), coherence, complexity, understandability, completeness, fairness, and trust.
Typical items, adopted or recommended in the literature, include:
| Construct | Example Item | Scale |
|---|---|---|
| Overall Satisfaction | “Overall, I am satisfied with this explanation.” | 1–6 Likert |
| Feasibility | “The suggested changes seem realistic and actionable in this context.” | 1–6 Likert |
| Trust | “I believe that if I followed these suggested changes, they would succeed.” | 1–6 Likert |
| Understandability | “I feel like I understood the phrasing of the explanation well.” | 1–6 Likert |
| Complexity | “The explanation was too simple/too complex/just right.” | –2 to +2 |
Multi-item scales enable calculation of internal-consistency reliability (e.g., Cronbach’s α), factor analytic exploration, and disentangling of explanatory dimensions driving user satisfaction (Domnich et al., 7 Apr 2025).
3. Psychometric Properties and Factor Structure
When validated, explanation satisfaction scales generally achieve high sampling adequacy and clear factor structures:
- The Domnich et al. (Domnich et al., 7 Apr 2025) scale achieved Kaiser–Meyer–Olkin = 0.893 and three-component factor structure (scree-plot “elbow” after three).
- The first factor explained 40.5% of variance, with strong loadings for feasibility (0.7805), trust (0.7896), consistency (0.7697), completeness (0.6273), and fairness (0.6884). Understandability and complexity loaded on separate factors.
- The regression model for overall satisfaction was:
explaining of the variance (Domnich et al., 7 Apr 2025).
- Feasibility and trust consistently emerged as the strongest drivers, while completeness provided a secondary boost. Understandability had a small negative coefficient in the presence of other predictors, and fairness’s contribution was marginal.
This structure suggests that users’ overall satisfaction is determined by an intertwined set of explanatory virtues, with actionability and trustworthiness as primary pillars. The presence of stable factor structures across samples and domains supports the psychometric robustness of these instruments.
4. Application Contexts and Empirical Findings
Explanation satisfaction instruments have been deployed in a range of domains:
Autonomous Vehicles:
In simulated driving studies, explanation satisfaction has been shown to be highly sensitive to explanation errors. Kaufman et al. (Kaufman et al., 2024) report that mean satisfaction scores sharply decreased with increasing explanation errors:
- Accurate explanations:
- Low-error (“what” correct, “why” incorrect):
- High-error (“what” + “why” incorrect):
Linear mixed-effect modeling confirmed highly significant decrements per error-severity increment (all ). Contextual factors such as scenario harm and driving difficulty further amplified these effects.
In evaluations of counterfactual XAI methods, Domnich et al. (Domnich et al., 7 Apr 2025) demonstrated that, in addition to feasibility and trust, completeness and consistency provided meaningful contributions to satisfaction. Complexity appeared psychometrically separable and did not consistently penalize satisfaction, indicating that length or detail, when mapped appropriately to user expertise, need not be detrimental.
Demographic analyses showed ML and medical-expert participants applied more stringent standards, suggesting the necessity of tailoring explanation designs to user profiles.
5. Comparative Role and Limitations
Explanation satisfaction, while widely reported (cited in 10 of 14 major XAI evaluation surveys), is noted for its subjective, introspective nature (Löfström et al., 2022). Head-to-head comparative evaluations across explanation methods generally lack consensual cut-points or thresholds for acceptability. As such, the literature recommends complementing satisfaction metrics with objective, task-based measures such as appropriate trust, especially in comparative studies, to mitigate ceiling/floor effects and possible demand characteristics in subjective ratings.
A plausible implication is that, although satisfaction remains indispensable for user-centered design iterations and post-hoc usability assessment, researchers are urged to:
- Pilot multi-item scales and report standard psychometrics (e.g., Cronbach’s α, item-total correlations)
- Pair subjective satisfaction measures with objective behavioral tasks (e.g., error detection, rejection/acceptance of system outputs)
- Account for domain, scenario, and expertise effects through stratified or customized battery development
6. Recommendations for Scale Development and Best Practices
Authors surveying the XAI literature recommend that future scale development around explanation satisfaction adhere to the following guidelines (Löfström et al., 2022, Domnich et al., 7 Apr 2025):
- Adopt clear construct definitions: satisfaction should refer to perceived user understanding of system, explanation, and interface.
- Use or adapt multi-item Likert scales (5–7 points), with items addressing interpretability, helpfulness, confidence, and overall satisfaction.
- Report psychometrics: especially internal consistency metrics and factor analytic support; pilot scale items to ensure sampling adequacy and structural validity.
- In multi-method comparisons, combine subjective satisfaction with objective measures of trust or performance.
- Tailor item content and response anchors to the expertise and needs of the target user population.
These practices ensure that explanation satisfaction scales retain content validity, reliability, and interpretative value across diverse XAI contexts and user groups.
References
- [A Meta Survey of Quality Evaluation Criteria in Explanation Methods, (Löfström et al., 2022)]
- [What Did My Car Say? ... On Comfort, Reliance, Satisfaction, and Driving Confidence, (Kaufman et al., 2024)]
- [Predicting Satisfaction of Counterfactual Explanations ..., (Domnich et al., 7 Apr 2025)]