FGTI: Interpretability in AI Scoring
- FGTI is a framework for interpretability in AI scoring that mandates explanations reflect true computational processes.
- It ensures features are grounded in human-understandable components and supports step-by-step traceability for transparent decision-making.
- AnalyticScore exemplifies FGTI by enabling modular interventions and achieving competitive accuracy while maintaining system transparency.
Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) constitute a principled framework for interpretability in AI-driven automated scoring systems, especially in high-stakes, large-scale educational assessment contexts. Each principle addresses a unique stakeholder need—ensuring explanations reflect real model computations, features are meaningfully tethered to human-understandable elements, every step is inspectable, and humans can intervene modularly. The FGTI framework, formalized by Kim et al. in the context of the AnalyticScore system, aims to reconcile transparency and accuracy, supporting accountability and trust without operational compromise (Kim et al., 21 Nov 2025).
1. The Four Principles of FGTI: Definitions and Motivations
FGTI articulates the following four foundational requirements:
- Faithfulness: Explanations of scoring decisions must accurately reflect the computational mechanism behind the model’s prediction. This addresses the risk that post-hoc or "fake" explanations, such as synthetic chain-of-thoughts never actually used by the model, might mislead users or mask bias.
- Groundedness: Initial features computed by the model should represent human-understandable, explicitly identifiable elements of student work and the item task. Opaque features (e.g., deep embedding similarities or unlabeled n-gram vectors) prevent scrutiny and validation of fairness and relevance.
- Traceability: The model should consist of subroutines that each represent a specific, well-defined evidentiary reasoning step on clearly specified inputs. This enables stepwise inspection and debugging.
- Interchangeability: A human should be able to act interchangeably on each reasoning subroutine—swapping out model outputs or phases without wholesale system redesign.
In the domain of large-scale educational assessment, where interpretability is a necessity rather than a luxury, FGTI is positioned to satisfy the requirements of all stakeholders, including students, educators, and policymakers (Kim et al., 21 Nov 2025).
2. Instantiation of FGTI in AnalyticScore
AnalyticScore exemplifies FGTI via a modular, fully interpretable workflow:
- Component Extraction: Analytic components —concise, stand-alone facts or claims—are automatically extracted from all student responses using an LLM prompt. Each is a clearly verbalizable trait, such as “both pandas and koalas eat plants, whereas pythons are strictly carnivorous” for a comparative zoology item.
- Response Featurization: For each response and component , a labeling function is computed, corresponding to “no mention,” “partial paraphrase,” or “direct paraphrase.” Each label has explicit verbal meaning.
- Scoring via Ordinal Logistic Regression: The scoring mechanism computes an evidence sum:
and compares it against learned thresholds ; the predicted score satisfies .
Each phase is associated with a single function call possessing fully specified, human-understandable input and output. The entire system is strictly modular, enabling immediate tracing and manipulation at every step (Kim et al., 21 Nov 2025).
3. Empirical Validation
Empirical studies using AnalyticScore confirm that all four FGTI principles are achievable at scale:
- Faithfulness: AnalyticScore’s explanations replay the precise weighted sum and threshold comparison used for scoring, with zero deviation. For both GPT-4.1-mini and Llama-3.1 featurizers, features and weights presented in explanations correspond exactly to those producing a quadratic weighted kappa (QWK) of 0.72–0.71 on ASAP-SAS items, with 100% alignment between explanation and underlying computation.
- Groundedness: In featurization alignment studies, human raters achieved Krippendorff’s ≈ 0.67–0.72 on the same pairs, indicating high interpretability of components and labels. LLM-based featurizers reached QWK=0.90–0.95 (Science) and QWK=0.72–0.81 (Reading) with respect to human labels.
- Traceability: Worked examples allow inspectors to view each , associated weight , and threshold check, facilitating transparent tracing from response through to final score. During error analysis, low-weight components and rare component occurrences were rapidly identified, supporting targeted dataset refinement.
- Interchangeability: User studies confirmed that stakeholders could replace component lists, override individual labels, or swap regression thresholds in under 2 minutes per item—without retraining the model.
AnalyticScore achieved scoring accuracy within Δ=0.06 QWK of state-of-the-art uninterpretable models, while outperforming various black-box baselines such as AutoSAS, AsRRN, and NAM (Kim et al., 21 Nov 2025).
4. FGTI Principles in Tabular Summary
| Principle | Definition (per (Kim et al., 21 Nov 2025)) | AnalyticScore Instantiation |
|---|---|---|
| Faithfulness | Explanations reflect actual computational mechanism | Replays weighted sum and threshold logic verbatim |
| Groundedness | Features are human-understandable, explicitly identifiable elements | Uses analytic components & one-hot paraphrase labels |
| Traceability | Model is decomposable into subroutines on clear, specified input/output | Three explicit function-call phases |
| Interchangeability | Humans can intervene at any reasoning subroutine | Modularity enables arbitrary input/output swaps |
These principles are mutually reinforcing: grounded features enable traceable subroutines, traceability enables faithful explanations, and interchangeability depends on transparent, modular system design.
5. Stakeholder Implications and Practical Relevance
FGTI directly addresses a core impediment to trust in automated educational assessment—namely, the inability to challenge, verify, or improve opaque predictions. By ensuring that every step is grounded, transparent, and modifiable, FGTI simultaneously delivers:
- Inspectable, challengeable, and correctable explanations, fulfilling the needs of test-takers, item developers, and policymakers.
- Rapid iteration: Interchangeability allows near-instant changes to components or thresholds, avoiding time-intensive retraining cycles associated with black-box models.
- Empirical robustness: Performance near the state of the art without reliance on uninterpretable methods demonstrates that rigorous interpretability does not necessitate accuracy sacrifice.
6. Cohesion and Blueprint Character of the FGTI Framework
While each principle individually targets a distinct aspect of interpretability, their interdependence ensures comprehensive transparency. Groundedness supplies the basic units of reasoning; traceability organizes and exposes each transformation; faithfulness ensures that explanations cannot diverge from real model logic; interchangeability empowers domain experts to intervene meaningfully. AnalyticScore’s implementation illustrates how FGTI serves as a blueprint for constructing automated scorers that are simultaneously accurate, transparent, and responsive to stakeholder correction, establishing a reference standard for future research in interpretable AI-driven assessment (Kim et al., 21 Nov 2025).