Supported Faithfulness Score (SFS)
- Supported Faithfulness Score (SFS) is a metric that measures the mean support mass of atomic claims in generated outputs by verifying semantic entailment against reference evidence.
- It employs atomic claim decomposition via LLMs along with SBERT similarity and NLI classifiers to assess grounding in tasks like summarization and multi-step reasoning.
- Empirical evaluations show that SFS effectively distinguishes justification fidelity from answer correctness, highlighting its utility in faithfulness auditing and protocol assessment.
Supported Faithfulness Score (SFS) is a claim-level evaluation metric that quantifies the proportion of an output (such as a reasoning trace or summary) that is semantically supported and entailed by a reference evidence set. SFS provides a rigorous, decomposer-invariant framework for measuring the degree to which generated content can be grounded in provided sources, distinguishing answer correctness from the quality and faithfulness of justification. Recent research leverages SFS to analyze LLM reasoning, debate protocols, and summarization, revealing sharp degradation of grounding during multi-step or multi-agent reasoning (Mittal et al., 2023, Shin, 3 May 2026).
1. Formal Definition and Computational Procedure
SFS is defined as the mean support mass over all atomic claims decomposed from a model output. Given a reasoning trace or summary and an evidence set , the procedure is as follows:
- Atomic Claim Decomposition: An LLM-based "decomposer" (), such as GPT-4o or Claude-3.5, divides into atomic claims , ensuring each claim is a minimal, independently checkable subject-predicate proposition.
- Evidence Verification: For each claim , compute its support:
- is the cosine similarity between SBERT embeddings of and 0.
- 1 is the entailment verdict from a DeBERTa-v3-large NLI model (1 if 2, else 0).
- Score Aggregation: The Supported Faithfulness Score for 3 is the mean over all claims:
4
Values range from 0 (no claims supported) to 1 (all claims fully supported) (Shin, 3 May 2026).
An alternative, token-level SFS is defined as the proportion of tokens in a claim supported by the source, computed via the length of the Longest Supported Subsequence (LSS) divided by total claim length (Mittal et al., 2023).
2. Algorithmic Approaches and Model Implementations
There are two principal methodologies for computing SFS:
- Atomic Claim + NLI Approach:
- Atomic decomposition is performed by prompting an LLM to extract minimal, checkable units from the output.
- Each claim is matched against all passages in 5 using SBERT similarity, and entailment is verified via an NLI classifier.
- LSS-based Sequence Extraction:
- For tasks such as summarization, the LSS method computes the longest non-continuous subsequence of the claim that is supported by the evidence/context.
- Dynamic programming identifies supported tokens using a precomputed entailment matrix.
- The normalized SFS is then 6, where 7 is the claim and 8 the reference (Mittal et al., 2023).
Model implementations include fine-tuned T5 (3B parameter) sequence-to-sequence models for LSS extraction and combinations of SBERT and large-scale NLI models for claim verification.
3. Theoretical Properties and Guarantees
SFS is explicitly designed to satisfy several important axiomatic desiderata:
- Decomposer Invariance: SFS rankings across experimental conditions are invariant to the decomposer LLM choice, with perfect Spearman 9 observed between GPT-4o and Claude-3.5 (Shin, 3 May 2026).
- Evidence Sensitivity: Modifying the evidence set 0 alters SFS in the expected direction.
- Support-Mass Monotonicity: Adding entailed claims increases SFS; adding unsupported ('fabricated') claims yields zero increase.
- Fabrication Penalty: Outputs containing only unsupported claims score zero.
- Granularity: SFS is sensitive to surface-form drift in justification, often detecting >0.05 changes even when answer accuracy is unchanged (70% of pairs).
- Finite-Sample Concentration: Given 1 atomic claims, for any 2,
3
With 4, 0.10-point differences are highly reliable.
SFS operationalizes the idea of evidence-grounded faithfulness, tracking the proportion of surface-level claims in 5 justified by 6, and cleanly distinguishing between answer selection and the provenance of reasoning.
4. Empirical Evaluation and Benchmark Results
SFS has been empirically validated across multiple benchmarks and reasoning paradigms. Notable findings include:
| Condition/Protocol | Accuracy | SFS (Mean) | Relative SFS Change |
|---|---|---|---|
| SciFact Zero-shot (C1) | 0.588 | 0.349 | — |
| SocraSynth Debate (C4) | 0.481 | 0.213 | –39 % |
| DebateCV (C13/Reasoning Trap) | 0.517 | 0.200 | –43 % |
| Majority-vote MAD (C15) | 0.536 | 0.006 | –98 % |
| EGSR Recovery (C8) | 0.482 | 0.343 | +98 % (vs. C4/C13) |
Debate-style and closed-system multi-step reasoning protocols induce systematic SFS collapse despite relatively preserved answer accuracy, empirically verifying the "Reasoning Trap." The evidence-grounded Socratic Reasoning protocol (EGSR) restores SFS to near-baseline levels. These results hold across models (e.g., GPT-4o, Claude-3.5) and datasets (SciFact, FEVER), and are robust to statistical tests (e.g., Wilcoxon 7) (Shin, 3 May 2026).
In summarization tasks (e.g., XSum), LSS-based SFS exhibits higher correlation with human faithfulness ratings (BLEU on LSS yields ρ ≈ 0.49 vs. QuestEval ρ ≈ 0.30), with ChatGPT summaries achieving SFS ≈ 0.98–0.99 and weaker LLMs (LLaMA-7B, Vicuna-13B) scoring SFS ≈ 0.15–0.30 (Mittal et al., 2023).
5. Interpretation and Use of SFS Values
Interpretation of SFS values aligns with the intended grounding property:
- SFS ≈ 1.0: Nearly all atomic claims are both semantically similar to and entailed by some evidence passage. Indicates maximal faithfulness.
- SFS ≈ 0: Output contains no verifiable, evidence-grounded content; the reasoning trace is fully unanchored.
- Intermediate SFS (e.g., 0.2–0.4): Partial grounding; fraction of claims are unsubstantiated.
- Decreasing SFS: Indicates progressive justification drift in multi-step outputs, even if the final answer is correct.
SFS distinguishes justification chain fidelity from answer correctness, a distinction critical in risk-sensitive and high-stakes applications.
6. Limitations and Open Questions
Several limitations of SFS remain:
- Decomposer Dependence: The atomic claim extractor influences absolute SFS magnitudes, although experimental rankings are robust.
- Evidence Quality: SFS is contingent on evidence coverage and quality; with adversarial or irrelevant evidence, SFS does not confer answer correctness.
- Knowledge Gaps: SFS only measures faithfulness to 8, not correctness with respect to ground truth outside 9.
- Generality: Wikipedia-based LSS models may not generalize to other domains without further adaptation (Mittal et al., 2023, Shin, 3 May 2026).
Open research questions include SFS robustness under richer or noisier evidence, more advanced decomposers, and extensions to non-fact-verification domains.
7. Practical Applications and Guidance
SFS can be directly integrated into model evaluation and training pipelines:
- Faithfulness Auditing: Quantifies justification reliability in LLM outputs in scientific fact verification, summarization, and generative QA.
- Reranking/Decoding: SFS can serve as a reranking metric or as a differentiable reward for reinforcement learning finetuning to penalize hallucinations.
- Protocol Evaluation: SFS enables discrimination between protocols (Socratic inquiry vs. debate) in preserving grounding over multiple reasoning steps.
For practical use, fine-tuned T5 models for LSS generation or SBERT+NLI pipelines for atomic claim scoring provide off-the-shelf implementations (Mittal et al., 2023, Shin, 3 May 2026). Statistical concentration guarantees enable reliable detection of faithfulness gaps in large-scale evaluations.