- The paper identifies that answer placement confounds corruption-based chain-of-thought evaluations by prioritizing explicit answer text over intermediate reasoning.
- Methodologically, format ablations reveal up to 19x reduction in suffix sensitivity when the explicit answer is removed, highlighting scale-dependent effects.
- Implications include necessary protocol adjustments for CoT evaluations, affecting PRM design and benchmark standards to better capture genuine reasoning.
Methodological Confounds in Chain-of-Thought Corruption Studies: Answer Placement vs. Computation
Chain-of-thought (CoT) prompting has become a de facto standard for elucidating step-by-step reasoning in LLMs, with corruption studies serving as a principal empirical methodology for evaluating the faithfulness of these reasoning sequences. Such studies replace specific steps in a CoT chain with errors and measure the resultant changes in answer accuracy, attempting to pinpoint which positions are "computationally load-bearing." The paper "The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies" (2605.10799) identifies a critical methodological confound in this approach: when answer statements are explicit and appear in fixed positions within a chain, corruption sensitivity reflects answer placement rather than genuine computation.
The central empirical hypothesis is answer-text readout dominance: at chain consumption time, models predominantly rely on explicit answer text for their final answer, with intermediate reasoning playing a diminished causal role. This issue is endemic in benchmarks like GSM8K and MATH, where chains almost universally conclude with a terminal answer statement.
Experimental Design and Empirical Findings
The authors utilize a comprehensive protocol involving format ablations, conflicting-answer tests, answer-placement controls, and generation-time probes. Task slices span arithmetic word problems (GSM8K, MATH), synthetic reasoning domains, and commonsense scenarios. Evaluations are conducted across nine open-weight instruction-tuned models from four architecture families and multiple parameter scales (3B–32B).
Core Results
- Format Ablation: Removing the explicit answer statement from the chain's suffix in GSM8K collapses suffix sensitivity by approximately 19x at 3B scale (N=300, p=0.022), while preserving all intermediate reasoning. The effect is replicated at 7B scale (9.3x attenuation; N=76, p=7.8×10-3) and is corroborated in Qwen3-8B (N=299, p=0.004).
- Conflicting-Answer Protocol: When correct reasoning is followed by a wrong explicit answer, models overwhelmingly follow the wrong answer text. At 7B, CC accuracy drops to ≤0.02 across five architecture families, and the followed-wrong rate (FW) spans 0.63–1.00 at 3B–7B, attenuating to 0.300 at Phi-4-14B and ~0.01 at 32B.
- Bidirectional Controls & Relocation: Inserting an explicit answer suffix into chains without one establishes suffix sensitivity; relocating answer text to a prefix header results in drastic FW rate reductions (AFW=0.570, p<10-10).
- Cross-domain Evidence: The effect generalizes beyond arithmetic: on 150 commonsense reasoning items, the FW rate is 0.76, with accuracy drops mirroring answer placement rather than reasoning content.
Scale Gradient
The confound persists up to 14B scale (8.5x sensitivity ratio, p=0.001). Both direct override and format-determination effects converge toward zero at 32B, with models increasingly extracting the final answer from intermediate reasoning rather than explicit answer text.
Theoretical and Practical Implications
Protocol Recommendations
The paper proposes a minimal protocol for corruption-based CoT faithfulness evaluations:
- Question-only control: Ensure chain-enabled accuracy exceeds question-only accuracy.
- Format characterization: Explicitly report where in the chain the answer is encoded.
- All-position sweep: Perform independent corruption at prefix, middle, and suffix positions.
Implications for Process Supervision and Reward Modeling
Process reward models (PRMs) that assign step-level credit based on corruption sensitivity risk rewarding answer expression rather than actual reasoning. Format-diverse chains are required to robustly evaluate whether positional credit tracks computation or simply answer placement.
Limitations on Faithfulness Attribution
Corruption studies that fail to control for answer placement may produce positional sensitivity that is purely an artifact of chain format, particularly in datasets where gold chains consistently end with "the answer is X." This contaminates causal claims about computational depth and reasoning faithfulness.
Benchmark and Evaluation Suite Design
Benchmark creators and evaluators should ensure variation in answer placement and provide format metadata. Reliance on suffix-bearing chains guarantees suffix sensitivity and obfuscates whether models are actually "reasoning" or merely reading explicit answers.
Mechanistic Interpretations and Open Questions
The empirically established behavioral pattern—positional sensitivity tracking answer text, not computation—admits several mechanistic interpretations:
- Instruction-following heuristic: Models may treat terminal answer statements as authoritative due to RLHF/SFT protocols.
- Format-completion bias: Outputting the value in the answer slot is a learned template-driven behavior.
- Recency-weighted readout: Most recent answer-bearing text is preferentially weighted at readout.
Generation-time probes reveal genuine stepwise computation during chain generation (early commitment ratio <5%), yet at consumption-time, explicit answer text systematically overrides computation. At large scales, models transition to reliance on intermediate reasoning alone.
Contradictory and Strong Numerical Claims
- Corruption sensitivity at the suffix position drops by nearly 19x when only the answer statement is removed (3B scale).
- Conflicting explicit answer statements override correct computation entirely at 7B scale (FW up to 1.00, CC accuracy ≤0.02).
- Format-determination persists at 14B scale (8.5x sensitivity ratio), but both override and format sensitivity are negligible at 32B.
- Bidirectional evidence shows inserting/removing answer statements shifts corruption sensitivity to/from the affected position, confirming necessity and sufficiency of answer text for format-determination.
Conclusion
This paper (2605.10799) rigorously demonstrates that widely adopted corruption-based evaluations of CoT faithfulness are systematically confounded by chain format: positional sensitivity almost always reflects answer-text placement, not actual computation. This holds across multiple benchmarks, domains, model architectures, and scales, constraining the interpretation of prior corruption studies and necessitating protocol amendments for future faithfulness evaluations. The findings mandate explicit controls for answer placement before attributing corruption-based positional sensitivity to reasoning. At model scales up to 14B, format-based artifacts severely compromise faithfulness attribution; only at the largest scales do models reliably extract answers from reasoning, not explicit answer text. Practical implications extend to PRM design and benchmark creation, with broader relevance for evaluating robust reasoning in LLMs.