This paper introduces LaTRO (Latent Reasoning Optimization), a novel framework designed to enhance the reasoning capabilities of LLMs without relying on external feedback or pre-existing reward models. The core idea is to treat the reasoning process (the "thoughts" or "rationales" an LLM generates) as sampling from a latent distribution. LaTRO then optimizes this process using a variational approach, enabling the LLM to simultaneously improve its ability to generate high-quality reasoning steps and to evaluate the quality of these steps. This self-improvement loop is termed "self-rewarding."
The authors argue that while prompt-based methods like Chain-of-Thought (CoT) improve reasoning at inference time, optimizing these capabilities during training remains difficult due to the scarcity of high-quality reasoning data and the challenges of developing accurate external reward models for reinforcement learning. LaTRO addresses this by leveraging the LLM's own probability estimates as a reward signal.
Methodology:
- Problem Formulation: Standard LLM fine-tuning maximizes the likelihood of generating a correct answer given a query , i.e., . LaTRO introduces a latent reasoning rationale and aims to optimize the LLM by maximizing the expected log-likelihood of the answer given the query and the self-generated rationale, while regularizing the rationale generation process.
- Variational Approach: The objective is lower-bounded by an expression involving an auxiliary distribution for the rationales: , where is a prior reference LLM (typically the initial state of ).
- Self-Rewarding Objective: LaTRO simplifies this by setting the "reasoner" to be the LLM itself. The optimization objective becomes: . Here, acts as a self-generated reward: rationales that lead to a higher probability of the correct answer are considered better.
- Gradient Estimation: The gradient of involves two main parts:
- A policy gradient term for optimizing the rationale generator , using the REINFORCE Leave-One-Out (RLOO) estimator to reduce variance. The advantage for a sampled rationale is calculated based on its reward compared to the average reward of other sampled rationales. The reward includes the self-reward term and the KL-divergence penalty: .
- A maximum likelihood estimation (MLE) term for optimizing the LLM's ability to produce the correct answer given the query and the sampled rationale . The full empirical gradient estimator is: .
- Practical Implementation:
- During training, for each data point , rationales are sampled from the current LLM .
- Rationales are truncated to a maximum length or until an end-of-answer token.
- The model parameters are updated using the estimated gradient. The overall algorithm is summarized in Algorithm 1.
Experiments and Results:
LaTRO was evaluated on GSM8K (mathematical reasoning) and ARC-Challenge (logical reasoning) datasets using Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B models.
- Baselines:
1. Base model (zero-shot CoT). 2. Supervised Fine-Tuning (SFT): On GSM8K, SFT used golden rationales. On ARC-Challenge, SFT trained to directly generate answers (no golden rationales available).
- Key Findings on GSM8K:
- LaTRO significantly improved zero-shot accuracy over base models (average +12.5% for greedy decoding, +13.1% for self-consistency maj@8).
- LaTRO outperformed SFT models (average +9.6% for greedy decoding, +13.2% for self-consistency maj@8).
- For example, with Phi-3.5, LaTRO achieved 87.6% (GD) and 90.5% (maj@8) compared to base model's 72.9% (GD) and 74.0% (maj@8), and SFT's 75.8% (GD) and 77.1% (maj@8).
- Key Findings on ARC-Challenge:
- LaTRO showed improvements, though smaller than on GSM8K.
- Over base models: average +1.0% (GD), +2.4% (maj@8).
- Over SFT models (which performed worse than base models): average +5.2% (GD), +8.1% (maj@8).
- For example, with Phi-3.5, LaTRO achieved 86.4% (GD) and 87.5% (maj@8) compared to base model's 85.1% (GD) and 86.0% (maj@8), and SFT's 81.0% (GD) and 80.5% (maj@8).
- Ablation Studies (on GSM8K with Phi-3.5):
- Maximum Generation Length (L): Accuracy gains plateaued for tokens. Training with a shorter length (e.g., , denoted ) still improved performance under this constraint, suggesting LaTRO can train models to produce more concise rationales.
- Inference-time Scaling (Self-Consistency): LaTRO-trained models still benefited from sampling multiple rationales at inference time (self-consistency), with performance improving up to samples.
- Case Study: Qualitative analysis showed LaTRO-trained models generated more concise and correct reasoning steps compared to base and SFT models, which often made logical or arithmetic errors.
Contributions:
- A theoretical formulation linking LLM reasoning optimization to latent variable models.
- A self-rewarding mechanism using the model's own probability estimates, eliminating the need for external reward models or human feedback for rationale quality.
- Demonstration of significant performance gains on reasoning tasks across multiple LLM architectures, unlocking latent reasoning capabilities.
Conclusion:
LaTRO presents a principled and effective method for enhancing LLM reasoning by enabling models to self-improve their rationale generation and evaluation abilities. The results suggest that pre-trained LLMs possess untapped reasoning potential that can be unlocked through this self-rewarding optimization. While computationally intensive due to multiple rationale sampling during training, LaTRO offers a promising direction for developing more capable and self-evolving LLMs. Future work could explore more efficient sampling or adaptive rationale generation, and applying LaTRO to a broader range of tasks.