Faithful@k Metric for CoT Faithfulness

Updated 30 December 2025

Faithful@k is a metric that measures the probability that among k sampled chains-of-thought, at least one both flips the answer as intended and explicitly verbalizes the influencing hint.
The metric employs diverse stochastic decoding strategies to assess latent capacities in large language models, revealing narrative incompleteness rather than true unfaithfulness under low sampling budgets.
Empirical evaluations across models like Llama-3 and Gemma-3 demonstrate that increased sampling (higher k) significantly boosts faithful@k scores, offering actionable insights for multi-hop reasoning assessments.

The faithful@k metric quantifies the probability that, among $k$ chains-of-thought (CoTs) sampled from a model responding to a prompt with an injected hint, at least one CoT both flips the model’s answer as intended and verbalizes the influencing hint. Introduced to distinguish genuine reasoning unfaithfulness from the incompleteness intrinsic to natural-language traces, faithful@k exposes the model’s latent capacity to verbalize decision-influencing cues when provided with an expanded sampling budget. This addresses shortcomings in legacy metrics—such as Biasing Features—that conflate lossy narrative compression with unfaithfulness, and enables a rigorous evaluation of model interpretability in multi-hop reasoning contexts (Zaman et al., 28 Dec 2025).

1. Formal Specification and Mathematical Definition

faithful@k operates in the hint-based evaluation setting, where a model’s output is compared under a default prompt and an @@@@1@@@@ prompt containing an explicit cue (e.g., “A Stanford professor thinks the answer is B”). For each test example, CoTs are generated from the “hinted” model. Among those samples resulting in a label flip to the hint label $L_h$ , the subset that explicitly verbalizes the hint is counted. faithful@k is formally the probability that at least one in $k$ independently drawn CoTs both yields $L_h$ and verbalizes the hint.

Let $n$ denote the number of “answer-flipped” samples and $c$ the number of those deemed “faithful” via hint verbalization. The faithful@k estimator mirrors pass@k:

$\text{faithful@k} = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}}$

This computes, for each example, the likelihood that none among $k$ samples is faithful, and subtracts from unity. Aggregation over all relevant examples yields a dataset-level faithful@k.

2. Role and Interpretation of the k Parameter

The variable $k$ in faithful@k denotes the number of independent CoT samples drawn per example, achieved with stochastic decoding algorithms like nucleus sampling or top-k/top-p strategies (e.g., temperature $0.6$ for Llama-3). Unlike traditional greedy decoding, which produces a single chain, faithful@k leverages extra inference-time budget for diversity: either longer individual traces or multiple attempts of fixed length. In practice, $k$ is varied over $\{1, 2, 4, 8, 16\}$ , allowing researchers to interrogate how model expressivity changes with increased sampling. k reflects the operational cost of interpretability: higher k requires greater computation and reveals the latent space of hint-verbalizing traces.

3. Computational Protocol

To compute faithful@k, the following process is used:

Sample Generation: For each example and hint type, generate up to $n_{\text{max}}=128$ CoT samples using stochastic decoding.
Label Flipping: Identify those samples for which the model’s answer changes to $L_h$ . Set $n$ as the count of qualifying samples. Discard examples with $n < k$ .
Hint Verbalization Judgement: Employ an LLM-as-judge (e.g., gpt-oss-20b via DSPy) to mark which CoTs explicitly verbalize the hint— $c$ is the count.
Metric Calculation: For each eligible example, compute faithful@k using the pass@k formula.
Aggregation: Average faithful@k across all examples and report bootstrapped confidence intervals to quantify uncertainty.

4. Empirical Observations Across Models and Tasks

Experiments across Llama-3 (2B and 8B) and Gemma-3 (4B) on multi-hop QA tasks (ARC-Easy, OpenbookQA, StrategyQA) reveal distinct behaviors under various hint regimes:

Natural-Language Hints ("Professor"): For $k=1$ (greedy decoding), faithful@k is low (typically $<0.2$ –$0.3$). As $k$ reaches $16$, Gemma-3 4B attains $\sim0.9$ , Llama-3 2B $\sim0.45$ , and Llama-3 8B $\sim0.4$ . This pattern indicates that many CoTs capable of verbalizing the explicit cue only emerge with increased sampling, implying observed unfaithfulness at $k=1$ largely reflects narrative incompleteness rather than misalignment.
Non-Verbal/Schematic Hints ("Metadata," "Black Squares"): faithful@k remains static ( $<0.2$ –$0.3$) regardless of $k$ . This suggests the model does not incorporate non-verbal cues into its CoT narrative, even with an expanded sampling budget.
Task Breakdown: Consistent behaviors are observed across prompts and architectures, with “Professor” hints yielding steep faithful@k curves and other cues remaining flat.

A plausible implication is that tokens and sampling budget are critical determinants in observed faithfulness; models typically compress reasoning under resource constraints, omitting even influential cues unless allowed to sample more extensively.

5. Best Practices and Constraints in faithful@k Deployment

faithful@k is intended not as a standalone diagnostic, but as an adjunct to broader faithfulness toolkits (e.g., Filler Tokens, FUR, causal mediation analysis). Key considerations include:

Requirements for substantial per-example sampling (n=128 in experiments) and automated hint-verbalization detection pipelines (LLM-as-judge).
Independence assumptions in sampling; metric reliability depends on avoiding prefix recycling and strong repetition penalties.
Exclusion of examples with $n < k$ may bias toward easier cases unless $k$ is set conservatively.
faithful@k reports the potential for hint-verbalization, not the norm—the existence of faithful samples among $k$ does not guarantee comprehensive articulation of all reasoning factors.

The metric excels at separating cases where a model can, but may not always, verbalize a crucial cue from those of genuine faithfulness failure. Incompleteness (lossy compression) emerges as a central barrier to CoT interpretability under tight budgets.

6. Relationship to Wider Interpretability Frameworks

faithful@k exposes limitations in earlier hint-evaluation techniques (e.g., Biasing Features), which conflate narrative omission with model disconnect. Complementary metrics (causal mediation, corruption-based, Filler Tokens, FUR) are advocated to form a comprehensive interpretability assessment suite. faithfulness should be analyzed as a spectrum, ranging from capability (as revealed by faithful@k) to completeness (how often cues are expressed in practice).

By dissecting the gap between model-internal reasoning and natural-language output, faithful@k helps disentangle compressive effects of token constraints from true causal irrelevance. This informs interpretability methodology, experimental design, and the evaluation of LLMs on reasoning-intensive tasks.

7. Significance and Outlook

As model architectures and decoding strategies evolve, faithful@k provides a robust framework for probing the circumstances under which LLMs verbalize decision-critical information. Its flexibility accommodates variable inference-time budgets and exposes latent faithfulness otherwise masked by tight decoding policies. While computationally demanding, the metric’s nuance supports more accurate discrimination between unfaithfulness and incompleteness, guiding both research and practical deployment of interpretable LLM reasoning (Zaman et al., 28 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Faithful@k Metric.