Papers
Topics
Authors
Recent
Search
2000 character limit reached

Supported Faithfulness Score (SFS)

Updated 11 May 2026
  • Supported Faithfulness Score (SFS) is a metric that measures the mean support mass of atomic claims in generated outputs by verifying semantic entailment against reference evidence.
  • It employs atomic claim decomposition via LLMs along with SBERT similarity and NLI classifiers to assess grounding in tasks like summarization and multi-step reasoning.
  • Empirical evaluations show that SFS effectively distinguishes justification fidelity from answer correctness, highlighting its utility in faithfulness auditing and protocol assessment.

Supported Faithfulness Score (SFS) is a claim-level evaluation metric that quantifies the proportion of an output (such as a reasoning trace or summary) that is semantically supported and entailed by a reference evidence set. SFS provides a rigorous, decomposer-invariant framework for measuring the degree to which generated content can be grounded in provided sources, distinguishing answer correctness from the quality and faithfulness of justification. Recent research leverages SFS to analyze LLM reasoning, debate protocols, and summarization, revealing sharp degradation of grounding during multi-step or multi-agent reasoning (Mittal et al., 2023, Shin, 3 May 2026).

1. Formal Definition and Computational Procedure

SFS is defined as the mean support mass over all atomic claims decomposed from a model output. Given a reasoning trace or summary OO and an evidence set EE, the procedure is as follows:

  1. Atomic Claim Decomposition: An LLM-based "decomposer" (ϕ\phi), such as GPT-4o or Claude-3.5, divides OO into NN atomic claims {c1,,cN}\{c_1,\ldots,c_N\}, ensuring each claim is a minimal, independently checkable subject-predicate proposition.
  2. Evidence Verification: For each claim cic_i, compute its support:

si=maxeE[sim(ci,e)verified(ci,e)]s_i = \max_{e\in E}\left[\mathrm{sim}(c_i, e)\cdot \mathrm{verified}(c_i, e)\right]

  • sim(ci,e)\mathrm{sim}(c_i, e) is the cosine similarity between SBERT embeddings of cic_i and EE0.
  • EE1 is the entailment verdict from a DeBERTa-v3-large NLI model (1 if EE2, else 0).
  1. Score Aggregation: The Supported Faithfulness Score for EE3 is the mean over all claims:

EE4

Values range from 0 (no claims supported) to 1 (all claims fully supported) (Shin, 3 May 2026).

An alternative, token-level SFS is defined as the proportion of tokens in a claim supported by the source, computed via the length of the Longest Supported Subsequence (LSS) divided by total claim length (Mittal et al., 2023).

2. Algorithmic Approaches and Model Implementations

There are two principal methodologies for computing SFS:

  • Atomic Claim + NLI Approach:
    • Atomic decomposition is performed by prompting an LLM to extract minimal, checkable units from the output.
    • Each claim is matched against all passages in EE5 using SBERT similarity, and entailment is verified via an NLI classifier.
  • LSS-based Sequence Extraction:
    • For tasks such as summarization, the LSS method computes the longest non-continuous subsequence of the claim that is supported by the evidence/context.
    • Dynamic programming identifies supported tokens using a precomputed entailment matrix.
    • The normalized SFS is then EE6, where EE7 is the claim and EE8 the reference (Mittal et al., 2023).

Model implementations include fine-tuned T5 (3B parameter) sequence-to-sequence models for LSS extraction and combinations of SBERT and large-scale NLI models for claim verification.

3. Theoretical Properties and Guarantees

SFS is explicitly designed to satisfy several important axiomatic desiderata:

  • Decomposer Invariance: SFS rankings across experimental conditions are invariant to the decomposer LLM choice, with perfect Spearman EE9 observed between GPT-4o and Claude-3.5 (Shin, 3 May 2026).
  • Evidence Sensitivity: Modifying the evidence set ϕ\phi0 alters SFS in the expected direction.
  • Support-Mass Monotonicity: Adding entailed claims increases SFS; adding unsupported ('fabricated') claims yields zero increase.
  • Fabrication Penalty: Outputs containing only unsupported claims score zero.
  • Granularity: SFS is sensitive to surface-form drift in justification, often detecting >0.05 changes even when answer accuracy is unchanged (70% of pairs).
  • Finite-Sample Concentration: Given ϕ\phi1 atomic claims, for any ϕ\phi2,

ϕ\phi3

With ϕ\phi4, 0.10-point differences are highly reliable.

SFS operationalizes the idea of evidence-grounded faithfulness, tracking the proportion of surface-level claims in ϕ\phi5 justified by ϕ\phi6, and cleanly distinguishing between answer selection and the provenance of reasoning.

4. Empirical Evaluation and Benchmark Results

SFS has been empirically validated across multiple benchmarks and reasoning paradigms. Notable findings include:

Condition/Protocol Accuracy SFS (Mean) Relative SFS Change
SciFact Zero-shot (C1) 0.588 0.349
SocraSynth Debate (C4) 0.481 0.213 –39 %
DebateCV (C13/Reasoning Trap) 0.517 0.200 –43 %
Majority-vote MAD (C15) 0.536 0.006 –98 %
EGSR Recovery (C8) 0.482 0.343 +98 % (vs. C4/C13)

Debate-style and closed-system multi-step reasoning protocols induce systematic SFS collapse despite relatively preserved answer accuracy, empirically verifying the "Reasoning Trap." The evidence-grounded Socratic Reasoning protocol (EGSR) restores SFS to near-baseline levels. These results hold across models (e.g., GPT-4o, Claude-3.5) and datasets (SciFact, FEVER), and are robust to statistical tests (e.g., Wilcoxon ϕ\phi7) (Shin, 3 May 2026).

In summarization tasks (e.g., XSum), LSS-based SFS exhibits higher correlation with human faithfulness ratings (BLEU on LSS yields ρ ≈ 0.49 vs. QuestEval ρ ≈ 0.30), with ChatGPT summaries achieving SFS ≈ 0.98–0.99 and weaker LLMs (LLaMA-7B, Vicuna-13B) scoring SFS ≈ 0.15–0.30 (Mittal et al., 2023).

5. Interpretation and Use of SFS Values

Interpretation of SFS values aligns with the intended grounding property:

  • SFS ≈ 1.0: Nearly all atomic claims are both semantically similar to and entailed by some evidence passage. Indicates maximal faithfulness.
  • SFS ≈ 0: Output contains no verifiable, evidence-grounded content; the reasoning trace is fully unanchored.
  • Intermediate SFS (e.g., 0.2–0.4): Partial grounding; fraction of claims are unsubstantiated.
  • Decreasing SFS: Indicates progressive justification drift in multi-step outputs, even if the final answer is correct.

SFS distinguishes justification chain fidelity from answer correctness, a distinction critical in risk-sensitive and high-stakes applications.

6. Limitations and Open Questions

Several limitations of SFS remain:

  • Decomposer Dependence: The atomic claim extractor influences absolute SFS magnitudes, although experimental rankings are robust.
  • Evidence Quality: SFS is contingent on evidence coverage and quality; with adversarial or irrelevant evidence, SFS does not confer answer correctness.
  • Knowledge Gaps: SFS only measures faithfulness to ϕ\phi8, not correctness with respect to ground truth outside ϕ\phi9.
  • Generality: Wikipedia-based LSS models may not generalize to other domains without further adaptation (Mittal et al., 2023, Shin, 3 May 2026).

Open research questions include SFS robustness under richer or noisier evidence, more advanced decomposers, and extensions to non-fact-verification domains.

7. Practical Applications and Guidance

SFS can be directly integrated into model evaluation and training pipelines:

  • Faithfulness Auditing: Quantifies justification reliability in LLM outputs in scientific fact verification, summarization, and generative QA.
  • Reranking/Decoding: SFS can serve as a reranking metric or as a differentiable reward for reinforcement learning finetuning to penalize hallucinations.
  • Protocol Evaluation: SFS enables discrimination between protocols (Socratic inquiry vs. debate) in preserving grounding over multiple reasoning steps.

For practical use, fine-tuned T5 models for LSS generation or SBERT+NLI pipelines for atomic claim scoring provide off-the-shelf implementations (Mittal et al., 2023, Shin, 3 May 2026). Statistical concentration guarantees enable reliable detection of faithfulness gaps in large-scale evaluations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Supported Faithfulness Score (SFS).