The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

Published 11 May 2026 in cs.LG, cs.AI, and cs.CL | (2605.10799v1)

Abstract: Corruption studies, the primary tool for evaluating chain-of-thought (CoT) faithfulness, identify which chain positions are "computationally important" by measuring accuracy when steps are replaced with errors. We identify a systematic confound: for chains with explicit terminal answer statements, the dominant format in standard benchmarks, corruption studies detect where the answer text appears, not where computation occurs. A within-dataset format ablation provides the key evidence: on standard GSM8K chains ending with "the answer is X," removing only the answer statement, preserving all reasoning, collapses suffix sensitivity ~19x at 3B (N=300, p=0.022). Conflicting-answer experiments quantify the causal mechanism: at 7B, CC accuracy drops to near-zero (<=0.02) across five architecture families; the followed-wrong rate spans 0.63-1.00 at 3B-7B and attenuates at larger scales (0.300 at Phi-4-14B, ~0.01 at 32B). A within-stable 7B replication (9.3x attenuation, N=76, p=7.8e-3; Qwen3-8B N=299, p=0.004) provides converging evidence, and the pattern replicates on MATH (DeepSeek-R1-7B: 10.9x suffix-survival recovery). On chains without answer suffixes the same protocol identifies the prefix as load-bearing (Delta=-0.77, p<10^-12). Generation-time probes confirm a dissociation: the answer is not early-determined during generation (early commitment <5%), yet at consumption time model outputs systematically follow the explicit answer text. The format-determination effect persists through 14B (8.5x ratio, p=0.001) and converges toward zero at 32B. We propose a three-prerequisite protocol (question-only control, format characterization, all-position sweep) as a minimum standard for corruption-based faithfulness studies.

Abstract PDF Upgrade to Chat

Authors (1)

Gabriel Garcia

Summary

The paper identifies that answer placement confounds corruption-based chain-of-thought evaluations by prioritizing explicit answer text over intermediate reasoning.
Methodologically, format ablations reveal up to 19x reduction in suffix sensitivity when the explicit answer is removed, highlighting scale-dependent effects.
Implications include necessary protocol adjustments for CoT evaluations, affecting PRM design and benchmark standards to better capture genuine reasoning.

Methodological Confounds in Chain-of-Thought Corruption Studies: Answer Placement vs. Computation

Background and Problem Formulation

Chain-of-thought (CoT) prompting has become a de facto standard for elucidating step-by-step reasoning in LLMs, with corruption studies serving as a principal empirical methodology for evaluating the faithfulness of these reasoning sequences. Such studies replace specific steps in a CoT chain with errors and measure the resultant changes in answer accuracy, attempting to pinpoint which positions are "computationally load-bearing." The paper "The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies" (2605.10799) identifies a critical methodological confound in this approach: when answer statements are explicit and appear in fixed positions within a chain, corruption sensitivity reflects answer placement rather than genuine computation.

The central empirical hypothesis is answer-text readout dominance: at chain consumption time, models predominantly rely on explicit answer text for their final answer, with intermediate reasoning playing a diminished causal role. This issue is endemic in benchmarks like GSM8K and MATH, where chains almost universally conclude with a terminal answer statement.

Experimental Design and Empirical Findings

The authors utilize a comprehensive protocol involving format ablations, conflicting-answer tests, answer-placement controls, and generation-time probes. Task slices span arithmetic word problems (GSM8K, MATH), synthetic reasoning domains, and commonsense scenarios. Evaluations are conducted across nine open-weight instruction-tuned models from four architecture families and multiple parameter scales (3B–32B).

Core Results

Format Ablation: Removing the explicit answer statement from the chain's suffix in GSM8K collapses suffix sensitivity by approximately 19x at 3B scale (N=300, p=0.022), while preserving all intermediate reasoning. The effect is replicated at 7B scale (9.3x attenuation; N=76, p=7.8×10-3) and is corroborated in Qwen3-8B (N=299, p=0.004).
Conflicting-Answer Protocol: When correct reasoning is followed by a wrong explicit answer, models overwhelmingly follow the wrong answer text. At 7B, CC accuracy drops to ≤0.02 across five architecture families, and the followed-wrong rate (FW) spans 0.63–1.00 at 3B–7B, attenuating to 0.300 at Phi-4-14B and ~0.01 at 32B.
Bidirectional Controls & Relocation: Inserting an explicit answer suffix into chains without one establishes suffix sensitivity; relocating answer text to a prefix header results in drastic FW rate reductions (AFW=0.570, p<10-10).
Cross-domain Evidence: The effect generalizes beyond arithmetic: on 150 commonsense reasoning items, the FW rate is 0.76, with accuracy drops mirroring answer placement rather than reasoning content.

Scale Gradient

The confound persists up to 14B scale (8.5x sensitivity ratio, p=0.001). Both direct override and format-determination effects converge toward zero at 32B, with models increasingly extracting the final answer from intermediate reasoning rather than explicit answer text.

Theoretical and Practical Implications

Protocol Recommendations

The paper proposes a minimal protocol for corruption-based CoT faithfulness evaluations:

Question-only control: Ensure chain-enabled accuracy exceeds question-only accuracy.
Format characterization: Explicitly report where in the chain the answer is encoded.
All-position sweep: Perform independent corruption at prefix, middle, and suffix positions.

Implications for Process Supervision and Reward Modeling

Process reward models (PRMs) that assign step-level credit based on corruption sensitivity risk rewarding answer expression rather than actual reasoning. Format-diverse chains are required to robustly evaluate whether positional credit tracks computation or simply answer placement.

Limitations on Faithfulness Attribution

Corruption studies that fail to control for answer placement may produce positional sensitivity that is purely an artifact of chain format, particularly in datasets where gold chains consistently end with "the answer is X." This contaminates causal claims about computational depth and reasoning faithfulness.

Benchmark and Evaluation Suite Design

Benchmark creators and evaluators should ensure variation in answer placement and provide format metadata. Reliance on suffix-bearing chains guarantees suffix sensitivity and obfuscates whether models are actually "reasoning" or merely reading explicit answers.

Mechanistic Interpretations and Open Questions

The empirically established behavioral pattern—positional sensitivity tracking answer text, not computation—admits several mechanistic interpretations:

Instruction-following heuristic: Models may treat terminal answer statements as authoritative due to RLHF/SFT protocols.
Format-completion bias: Outputting the value in the answer slot is a learned template-driven behavior.
Recency-weighted readout: Most recent answer-bearing text is preferentially weighted at readout.

Generation-time probes reveal genuine stepwise computation during chain generation (early commitment ratio <5%), yet at consumption-time, explicit answer text systematically overrides computation. At large scales, models transition to reliance on intermediate reasoning alone.

Contradictory and Strong Numerical Claims

Corruption sensitivity at the suffix position drops by nearly 19x when only the answer statement is removed (3B scale).
Conflicting explicit answer statements override correct computation entirely at 7B scale (FW up to 1.00, CC accuracy ≤0.02).
Format-determination persists at 14B scale (8.5x sensitivity ratio), but both override and format sensitivity are negligible at 32B.
Bidirectional evidence shows inserting/removing answer statements shifts corruption sensitivity to/from the affected position, confirming necessity and sufficiency of answer text for format-determination.

Conclusion

This paper (2605.10799) rigorously demonstrates that widely adopted corruption-based evaluations of CoT faithfulness are systematically confounded by chain format: positional sensitivity almost always reflects answer-text placement, not actual computation. This holds across multiple benchmarks, domains, model architectures, and scales, constraining the interpretation of prior corruption studies and necessitating protocol amendments for future faithfulness evaluations. The findings mandate explicit controls for answer placement before attributing corruption-based positional sensitivity to reasoning. At model scales up to 14B, format-based artifacts severely compromise faithfulness attribution; only at the largest scales do models reliably extract answers from reasoning, not explicit answer text. Practical implications extend to PRM design and benchmark creation, with broader relevance for evaluating robust reasoning in LLMs.

Markdown Report Issue