Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision (2505.14999v2)

Published 21 May 2025 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Mathematical reasoning presents a significant challenge for LLMs, often requiring robust multi step logical consistency. While Chain of Thought (CoT) prompting elicits reasoning steps, it doesn't guarantee correctness, and improving reliability via extensive sampling is computationally costly. This paper introduces the Energy Outcome Reward Model (EORM), an effective, lightweight, post hoc verifier. EORM leverages Energy Based Models (EBMs) to simplify the training of reward models by learning to assign a scalar energy score to CoT solutions using only outcome labels, thereby avoiding detailed annotations. It achieves this by interpreting discriminator output logits as negative energies, effectively ranking candidates where lower energy is assigned to solutions leading to correct final outcomes implicitly favoring coherent reasoning. On mathematical benchmarks (GSM8k, MATH), EORM significantly improves final answer accuracy (e.g., with Llama 3 8B, achieving 90.7% on GSM8k and 63.7% on MATH). EORM effectively leverages a given pool of candidate solutions to match or exceed the performance of brute force sampling, thereby enhancing LLM reasoning outcome reliability through its streamlined post hoc verification process.

Summary

The paper presents an energy-based reward model (EORM) that reranks chain-of-thought solutions by assigning lower energies to correct outcomes.
The methodology employs a Transformer encoder with an MLP head and a Bradley-Terry loss to efficiently distinguish correct from incorrect reasoning paths.
Experiments on GSM8k, MATH, and OOD benchmarks demonstrate that EORM significantly enhances final answer accuracy and generalizes to unseen problems.

This paper introduces the Energy Outcome Reward Model (EORM), a lightweight, post-hoc verifier designed to improve the mathematical reasoning capabilities of LLMs. The core problem addressed is that while Chain-of-Thought (CoT) prompting elicits reasoning steps from LLMs, it doesn't guarantee correctness, and existing methods to improve reliability, like extensive sampling, are computationally expensive.

EORM leverages Energy-Based Models (EBMs) to rank CoT solutions. It is trained to assign a scalar energy score to each CoT, where lower energy indicates a higher preference for solutions leading to correct final outcomes. A key aspect is its use of only outcome labels (i.e., whether the final answer of a CoT is correct or incorrect) for training, avoiding the need for detailed step-by-step annotations or preference pair labels, thus reducing annotation costs. The paper demonstrates that standard classifier output logits can be interpreted as negative energies, simplifying the training of the EBM-based reward model.

Methodology:

The EORM architecture consists of a Transformer encoder followed by a small Multi-Layer Perceptron (MLP) head. An input CoT solution is tokenized (with a [CLS] token prepended), passed through the encoder, and the hidden state of the [CLS] token is fed to the MLP head to produce a single scalar energy score $E_\theta(y)$ .

Training is performed using a pairwise Bradley-Terry loss. For a given problem, multiple CoT candidates are generated. These are grouped into correct ( $y_+$ ) and incorrect ( $y_-$ ) solutions based on their final answers. The loss function encourages the model to assign lower energy to correct solutions compared to incorrect ones within the same problem context:

$\mathcal{L}(\theta;\mathcal{Y}_n) \;=\; \frac{1}{|\mathcal{Y}_+||\mathcal{Y}_-|} \sum_{y_+\in\mathcal{Y}_+}\sum_{y_-\in\mathcal{Y}_-} \log\Bigl(1 + \exp\bigl(E_\theta(y_+) - E_\theta(y_-)\bigr)\Bigr)$

During deployment, EORM processes a pool of candidate CoT solutions generated by an LLM for a given problem and selects the one with the lowest assigned energy score as the final answer.

Experiments and Results:

EORM was evaluated on mathematical reasoning benchmarks: GSM8k and MATH for in-distribution tasks, and AIME 2024, AMC, AGIEval SAT Math, and Gaokao Math for out-of-distribution (OOD) generalization. The base LLMs used for generating CoT candidates included Mistral-7B, DeepSeekMath-7B, Llama 3 8B, Qwen 2.5 7B, and Llama 2 7B.

For training EORM, CoT solutions were generated from these LLMs for problems in the GSM8k and MATH training splits ( $n=256$ candidates per problem). For evaluation, $n=256$ candidates were generated for in-distribution problems and $n=64$ for OOD problems, with EORM selecting the best candidate.

In-Distribution Performance: EORM significantly improved final answer accuracy. For instance, with Llama 3 8B, it achieved 90.7% on GSM8k and 63.7% on MATH by reranking 256 candidates. The paper shows that performance generally scales with the number of samples. Kernel Density Estimation plots of energy scores demonstrated a clear separation between correct and incorrect solutions.
Out-of-Distribution Performance: EORM demonstrated strong generalization to unseen problems and different reasoning styles. Trained only on GSM8k and MATH data, when applied to Llama 3 8B outputs for OOD tasks, EORM generally outperformed other reranking baselines like TTRL and MathWizard, achieving, for example, 10.0% on AIME 2024 and 70.3% on AGIE Gaokao Math with 64 candidates. Similar improvements were observed with Qwen 2.5 7B.

Main Contributions:

Efficient EBM Reranker for CoT Reasoning: EORM provides an energy-based verifier to accurately assess and rerank CoT solutions from a limited set of candidates.
Logits as Energy for Reward Modeling: The paper shows the utility of interpreting classifier logits as negative energies for straightforward training of an energy-based reward model.
Significant Empirical Improvements: EORM substantially improves final answer accuracy on diverse mathematical reasoning benchmarks (GSM8k, MATH) and demonstrates robust generalization to OOD datasets.

The paper concludes that EORM is an effective and efficient post-hoc verifier that enhances LLM reasoning reliability, particularly in settings with constrained computational resources and limited annotation capabilities, by learning a discriminative energy function from simple outcome labels. The code is available at \url{https://github.com/ericjiang18/EnergyORM/tree/main}.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Eric1706291394/status/1934863568777957577

YouTube

Show All Videos