System Causability Scale (SCS)
- System Causability Scale (SCS) is a psychometric tool that measures the clarity and quality of causal explanations in xAI through a structured 10-item Likert scale.
- It assesses dimensions such as precision, contextual relevance, and explanation effectiveness, yielding normalized scores that indicate overall explanation quality.
- The scale includes an Italian adaptation (I-SCS) and has been applied in domains like medical risk assessment, with potential for broader use in human–AI interactions.
The System Causability Scale (SCS) is a psychometric instrument designed to quantify the quality of explanations in Explainable Artificial Intelligence (xAI), with a particular emphasis on the user's ability to understand and act upon causal reasoning in human–AI interactions. Developed as a counterpart to the System Usability Scale (SUS), the SCS operationalizes the notion of "causability," measuring how well explanations facilitate causal understanding, effectiveness, efficiency, and satisfaction within a specified use context. The SCS and its validated Italian adaptation (I-SCS) are applicable across domains, enabling assessment of both human- and machine-generated explanations for opaque AI/ML systems (Holzinger et al., 2019, Attanasio et al., 22 Apr 2025).
1. Conceptual Foundations: Causability and its Measurement
Causability is defined as the extent to which an explanation achieves a specified level of causal understanding with effectiveness, efficiency, and satisfaction for a particular user context. This construct formalizes the ability of an explanation to enable the user to form or refine a causal mental model () congruent with the machine model () or ground truth (gt). While traditional explainability approaches emphasize model internals or feature attributions, causability shifts focus toward the user’s capacity for domain-relevant causal reasoning. The SCS operationalizes this through a structured scale, capturing the completeness of causal factors, granularity, contextual relevance, consistency, and level-of-detail control, among other dimensions (Holzinger et al., 2019).
2. Structure and Scoring of the System Causability Scale
The original SCS comprises ten items, each rated on a five-point Likert scale (1 = strongly disagree to 5 = strongly agree), tailored to probe causal clarity and explanatory usefulness rather than mere interface usability. The scoring procedure for a participant is:
- Let denote the score for item , .
- The unnormalized total score is , with range .
- The normalized score is , producing a unit interval between $0.2$ and $1.0$.
For example, a participant providing ratings %%%%10%%%% yields and (Holzinger et al., 2019). Items capture aspects including:
- Inclusion of all relevant causal factors with sufficient precision.
- Contextual understanding of the explanations.
- Adjustable detail level on demand.
- Independence from support when interpreting explanations.
- Facilitated causal understanding.
- Compatibility with prior knowledge base.
- Consistency between explanations.
- Rapid learnability for most users.
- Sufficiency of references in explanations.
- Timeliness and efficiency in delivery.
3. Empirical Validation and Statistical Considerations
The initial publication of the SCS does not provide full psychometric validation—no factor analysis or reliability coefficients (e.g., Cronbach’s ) are reported, although methodological analogs to SUS reliability are posited. The recommended approach for future validation is:
for items. The authors suggest a minimum pilot sample of $10$–$15$ participants, increasing to $30+$ for stable variance and reliability estimates. Interpretation of raw SCS scores follows a relative framework: near $1.0$ indicates strong causability, scores suggest below-average causal clarity, analogous to SUS interpretive thresholds (Holzinger et al., 2019).
4. International Adaptation: The Italian Version (I-SCS)
The Italian version (I-SCS) underwent rigorous forward–backward translation and content validation, resulting in a nine-item instrument (one item removed for CVR below $0.49$). The validation protocol includes expert panel reconciliation, cognitive interviews for comprehensibility, and calculation of content validity metrics: Content Validity Ratio (CVR) and Content Validity Index (CVI):
- CVR (Lawshe): quantifies the proportion of experts rating an item as "essential."
- CVI: proportion rating as "relevant," both per item (-CVI) and on the scale (-CVI/Ave).
Scoring for the I-SCS is normalized as:
where is the rating for item , yielding a range $0.2$ to $1.0$ (Attanasio et al., 22 Apr 2025). Internal consistency, test–retest reliability, and further psychometric properties await future empirical validation.
5. Use Cases and Implementation Guidelines
The SCS has been illustrated in a medical domain through assessment of the Framingham Risk Tool, demonstrating rapid identification of explanatory strengths and weaknesses. Recommended best practices for administering the SCS/I-SCS include:
- Immediate post-session deployment in a controlled explanation context.
- Encouragement of full Likert range utilization.
- Use of $10$–$15$ participants for pilots; expansion for stable reliability assessment.
- Contextual adaptation—particularly for cross-cultural or domain-specific studies.
- Reporting of normalized scores, means, standard deviations, reliability metrics, and any procedural changes in publications (Holzinger et al., 2019, Attanasio et al., 22 Apr 2025).
A plausible implication is that pairing the SCS/I-SCS with usability scales or trust/acceptance measures may yield a multidimensional evaluation of explanation quality in xAI deployments.
6. Strengths, Limitations, and Prospects for Improvement
The SCS is characterized by operational simplicity (ten items, five minutes), leveraging the familiar Likert template, and direct focus on causal understanding. Chief limitations include its ordinal scale assumptions, lack of current full psychometric validation, and undetermined decision thresholds for categorizing causability quality. The Italian adaptation further demonstrates robust linguistic and content validity but similarly awaits establishment of internal consistency and test-retest reliability.
Potential improvements, as suggested by authors, comprise:
- Large-scale normative studies for benchmark and factor structure determination.
- Domain-specific item extension (e.g., regulatory or ethical compliance).
- Alternative scoring regimes, such as differential item weighting, to sharpen sensitivity for critical aspects of causal explanation (Holzinger et al., 2019, Attanasio et al., 22 Apr 2025).
7. Comparative Assessment of Human vs. Machine Explanations and Future Directions
No systematic head-to-head evaluation of human- versus machine-generated explanations using the SCS is reported in initial publications; only suggestive proposals for such studies appear. The instrument is positioned as a pragmatic baseline for quantifying the user-centered causal clarity of explanations, with the expectation that future research will empirically calibrate its thresholds, establish normative interpretability, and extend its application to diverse settings—including routine comparison of algorithmic and human reasoning efficacy.
The System Causability Scale thus establishes an empirically grounded framework for assessing explanation quality in xAI, poised for further methodological development and domain adaptation across multilingual, multidisciplinary contexts.