Papers
Topics
Authors
Recent
2000 character limit reached

FGTI: Interpretability in AI Scoring

Updated 27 November 2025
  • FGTI is a framework for interpretability in AI scoring that mandates explanations reflect true computational processes.
  • It ensures features are grounded in human-understandable components and supports step-by-step traceability for transparent decision-making.
  • AnalyticScore exemplifies FGTI by enabling modular interventions and achieving competitive accuracy while maintaining system transparency.

Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) constitute a principled framework for interpretability in AI-driven automated scoring systems, especially in high-stakes, large-scale educational assessment contexts. Each principle addresses a unique stakeholder need—ensuring explanations reflect real model computations, features are meaningfully tethered to human-understandable elements, every step is inspectable, and humans can intervene modularly. The FGTI framework, formalized by Kim et al. in the context of the AnalyticScore system, aims to reconcile transparency and accuracy, supporting accountability and trust without operational compromise (Kim et al., 21 Nov 2025).

1. The Four Principles of FGTI: Definitions and Motivations

FGTI articulates the following four foundational requirements:

  1. Faithfulness: Explanations of scoring decisions must accurately reflect the computational mechanism behind the model’s prediction. This addresses the risk that post-hoc or "fake" explanations, such as synthetic chain-of-thoughts never actually used by the model, might mislead users or mask bias.
  2. Groundedness: Initial features computed by the model should represent human-understandable, explicitly identifiable elements of student work and the item task. Opaque features (e.g., deep embedding similarities or unlabeled n-gram vectors) prevent scrutiny and validation of fairness and relevance.
  3. Traceability: The model should consist of subroutines that each represent a specific, well-defined evidentiary reasoning step on clearly specified inputs. This enables stepwise inspection and debugging.
  4. Interchangeability: A human should be able to act interchangeably on each reasoning subroutine—swapping out model outputs or phases without wholesale system redesign.

In the domain of large-scale educational assessment, where interpretability is a necessity rather than a luxury, FGTI is positioned to satisfy the requirements of all stakeholders, including students, educators, and policymakers (Kim et al., 21 Nov 2025).

2. Instantiation of FGTI in AnalyticScore

AnalyticScore exemplifies FGTI via a modular, fully interpretable workflow:

  • Component Extraction: Analytic components c1,,ckc_1, \ldots, c_k—concise, stand-alone facts or claims—are automatically extracted from all student responses using an LLM prompt. Each cic_i is a clearly verbalizable trait, such as “both pandas and koalas eat plants, whereas pythons are strictly carnivorous” for a comparative zoology item.
  • Response Featurization: For each response rr and component cc, a labeling function f(r;c){0,1,2}f(r;c) \in \{0,1,2\} is computed, corresponding to “no mention,” “partial paraphrase,” or “direct paraphrase.” Each label has explicit verbal meaning.
  • Scoring via Ordinal Logistic Regression: The scoring mechanism computes an evidence sum:

η(r)=i=1kwif(r,ci)\eta(r) = \sum_{i=1}^k w_i f(r, c_i)

and compares it against learned thresholds θj\theta_j; the predicted score jj satisfies θjη(r)<θj+1\theta_j \leq \eta(r) < \theta_{j+1}.

Each phase is associated with a single function call possessing fully specified, human-understandable input and output. The entire system is strictly modular, enabling immediate tracing and manipulation at every step (Kim et al., 21 Nov 2025).

3. Empirical Validation

Empirical studies using AnalyticScore confirm that all four FGTI principles are achievable at scale:

  • Faithfulness: AnalyticScore’s explanations replay the precise weighted sum and threshold comparison used for scoring, with zero deviation. For both GPT-4.1-mini and Llama-3.1 featurizers, features and weights presented in explanations correspond exactly to those producing a quadratic weighted kappa (QWK) of 0.72–0.71 on ASAP-SAS items, with 100% alignment between explanation and underlying computation.
  • Groundedness: In featurization alignment studies, human raters achieved Krippendorff’s α\alpha ≈ 0.67–0.72 on the same (r,c)(r, c) pairs, indicating high interpretability of components and labels. LLM-based featurizers reached QWK=0.90–0.95 (Science) and QWK=0.72–0.81 (Reading) with respect to human labels.
  • Traceability: Worked examples allow inspectors to view each f(r,ci)f(r, c_i), associated weight wiw_i, and threshold check, facilitating transparent tracing from response through to final score. During error analysis, low-weight components and rare component occurrences were rapidly identified, supporting targeted dataset refinement.
  • Interchangeability: User studies confirmed that stakeholders could replace component lists, override individual f(r;ci)f(r; c_i) labels, or swap regression thresholds in under 2 minutes per item—without retraining the model.

AnalyticScore achieved scoring accuracy within Δ=0.06 QWK of state-of-the-art uninterpretable models, while outperforming various black-box baselines such as AutoSAS, AsRRN, and NAM (Kim et al., 21 Nov 2025).

4. FGTI Principles in Tabular Summary

Principle Definition (per (Kim et al., 21 Nov 2025)) AnalyticScore Instantiation
Faithfulness Explanations reflect actual computational mechanism Replays weighted sum and threshold logic verbatim
Groundedness Features are human-understandable, explicitly identifiable elements Uses analytic components & one-hot paraphrase labels
Traceability Model is decomposable into subroutines on clear, specified input/output Three explicit function-call phases
Interchangeability Humans can intervene at any reasoning subroutine Modularity enables arbitrary input/output swaps

These principles are mutually reinforcing: grounded features enable traceable subroutines, traceability enables faithful explanations, and interchangeability depends on transparent, modular system design.

5. Stakeholder Implications and Practical Relevance

FGTI directly addresses a core impediment to trust in automated educational assessment—namely, the inability to challenge, verify, or improve opaque predictions. By ensuring that every step is grounded, transparent, and modifiable, FGTI simultaneously delivers:

  • Inspectable, challengeable, and correctable explanations, fulfilling the needs of test-takers, item developers, and policymakers.
  • Rapid iteration: Interchangeability allows near-instant changes to components or thresholds, avoiding time-intensive retraining cycles associated with black-box models.
  • Empirical robustness: Performance near the state of the art without reliance on uninterpretable methods demonstrates that rigorous interpretability does not necessitate accuracy sacrifice.

6. Cohesion and Blueprint Character of the FGTI Framework

While each principle individually targets a distinct aspect of interpretability, their interdependence ensures comprehensive transparency. Groundedness supplies the basic units of reasoning; traceability organizes and exposes each transformation; faithfulness ensures that explanations cannot diverge from real model logic; interchangeability empowers domain experts to intervene meaningfully. AnalyticScore’s implementation illustrates how FGTI serves as a blueprint for constructing automated scorers that are simultaneously accurate, transparent, and responsive to stakeholder correction, establishing a reference standard for future research in interpretable AI-driven assessment (Kim et al., 21 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI).