Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (2411.04282v2)

Published 6 Nov 2024 in cs.AI, cs.CL, cs.LG, and stat.ML

Abstract: LLMs have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \url{https://github.com/SalesforceAIResearch/LaTRO}.

PDF HTML Abstract

This paper introduces LaTRO (Latent Reasoning Optimization), a novel framework designed to enhance the reasoning capabilities of LLMs without relying on external feedback or pre-existing reward models. The core idea is to treat the reasoning process (the "thoughts" or "rationales" an LLM generates) as sampling from a latent distribution. LaTRO then optimizes this process using a variational approach, enabling the LLM to simultaneously improve its ability to generate high-quality reasoning steps and to evaluate the quality of these steps. This self-improvement loop is termed "self-rewarding."

The authors argue that while prompt-based methods like Chain-of-Thought (CoT) improve reasoning at inference time, optimizing these capabilities during training remains difficult due to the scarcity of high-quality reasoning data and the challenges of developing accurate external reward models for reinforcement learning. LaTRO addresses this by leveraging the LLM's own probability estimates as a reward signal.

Methodology:

Problem Formulation: Standard LLM fine-tuning maximizes the likelihood of generating a correct answer $y$ given a query $x$ , i.e., $\log \pi_\theta(y | x)$ . LaTRO introduces a latent reasoning rationale $z$ and aims to optimize the LLM $\pi_\theta$ by maximizing the expected log-likelihood of the answer given the query and the self-generated rationale, while regularizing the rationale generation process.
Variational Approach: The objective $\log \pi_\theta(y | x)$ is lower-bounded by an expression involving an auxiliary distribution $q(z|x)$ for the rationales: $E_{q(z | x)}\big[\log \pi_{\theta}(y | x\oplus z)\big] - D_{KL}[q(z |x)|| \pi_0(z | x)]$ , where $\pi_0$ is a prior reference LLM (typically the initial state of $\pi_\theta$ ).
Self-Rewarding Objective: LaTRO simplifies this by setting the "reasoner" $q(z|x)$ to be the LLM $\pi_\theta(z|x)$ itself. The optimization objective becomes: $J(\theta) = E_{(x, y)\sim D_{\text{Gold}}} \left[ E_{z \sim \pi_{\theta}(\cdot | x)}\big[\underbrace{\log \pi_{\theta}(y | x\oplus z)}_{R_\theta(z, y, x)}\big] - D_{KL}[\pi_\theta(z |x)|| \pi_0(z | x)] \right]$ . Here, $R_\theta(z, y, x) = \log \pi_{\theta}(y | x\oplus z)$ acts as a self-generated reward: rationales $z$ that lead to a higher probability of the correct answer $y$ are considered better.
Gradient Estimation: The gradient of $J(\theta)$ $J (θ)$ involves two main parts:
- A policy gradient term for optimizing the rationale generator $\pi_\theta(z|x)$ , using the REINFORCE Leave-One-Out (RLOO) estimator to reduce variance. The advantage $A_k^{(i)}$ for a sampled rationale $z_k^{(i)}$ is calculated based on its reward $r(z_k^{(i)})$ compared to the average reward of other $K-1$ sampled rationales. The reward $r(z)$ includes the self-reward term and the KL-divergence penalty: $r(z_k^{(i)}) := \log \pi_\theta(y_i~|~x_i \oplus z_{k}^{(i)}) - \beta \log \frac{\pi_{\theta}(z_k^{(i)} ~|~x_i )}{\pi_{0}(z_k^{(i)} ~|~ x_i)}$ .
- A maximum likelihood estimation (MLE) term for optimizing the LLM's ability to produce the correct answer $y_i$ given the query $x_i$ and the sampled rationale $z_k^{(i)}$ . The full empirical gradient estimator is: $\nabla_{\theta} \widehat{J}(\theta) := \frac{1}{NK}\sum_{i=1}^{N}\sum_{k=1}^{K}\bigg( \nabla_\theta \log \pi_{\theta}(z_k^{(i)} ~|~ x_i)\cdot A_k^{(i)} + \nabla_\theta \log \pi_\theta (y_i ~|~ x_i \oplus z_k^{(i)} ) \bigg)$ .
Practical Implementation:
- During training, for each data point $(x_i, y_i)$ , $K$ rationales $z_k^{(i)}$ are sampled from the current LLM $\pi_\theta$ .
- Rationales are truncated to a maximum length $L$ or until an end-of-answer token.
- The model parameters $\theta$ are updated using the estimated gradient. The overall algorithm is summarized in Algorithm 1.

Experiments and Results:

LaTRO was evaluated on GSM8K (mathematical reasoning) and ARC-Challenge (logical reasoning) datasets using Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B models.

Baselines:

1. Base model (zero-shot CoT). 2. Supervised Fine-Tuning (SFT): On GSM8K, SFT used golden rationales. On ARC-Challenge, SFT trained to directly generate answers (no golden rationales available).

Key Findings on GSM8K:
- LaTRO significantly improved zero-shot accuracy over base models (average +12.5% for greedy decoding, +13.1% for self-consistency maj@8).
- LaTRO outperformed SFT models (average +9.6% for greedy decoding, +13.2% for self-consistency maj@8).
- For example, with Phi-3.5, LaTRO achieved 87.6% (GD) and 90.5% (maj@8) compared to base model's 72.9% (GD) and 74.0% (maj@8), and SFT's 75.8% (GD) and 77.1% (maj@8).
Key Findings on ARC-Challenge:
- LaTRO showed improvements, though smaller than on GSM8K.
- Over base models: average +1.0% (GD), +2.4% (maj@8).
- Over SFT models (which performed worse than base models): average +5.2% (GD), +8.1% (maj@8).
- For example, with Phi-3.5, LaTRO achieved 86.4% (GD) and 87.5% (maj@8) compared to base model's 85.1% (GD) and 86.0% (maj@8), and SFT's 81.0% (GD) and 80.5% (maj@8).
Ablation Studies (on GSM8K with Phi-3.5):
- Maximum Generation Length (L): Accuracy gains plateaued for $L \geq 500$ tokens. Training with a shorter length (e.g., $L=200$ , denoted $\text{LaTRO}_{200}$ ) still improved performance under this constraint, suggesting LaTRO can train models to produce more concise rationales.
- Inference-time Scaling (Self-Consistency): LaTRO-trained models still benefited from sampling multiple rationales at inference time (self-consistency), with performance improving up to $k=8$ samples.
Case Study: Qualitative analysis showed LaTRO-trained models generated more concise and correct reasoning steps compared to base and SFT models, which often made logical or arithmetic errors.

Contributions:

A theoretical formulation linking LLM reasoning optimization to latent variable models.
A self-rewarding mechanism using the model's own probability estimates, eliminating the need for external reward models or human feedback for rationale quality.
Demonstration of significant performance gains on reasoning tasks across multiple LLM architectures, unlocking latent reasoning capabilities.

Conclusion:

LaTRO presents a principled and effective method for enhancing LLM reasoning by enabling models to self-improve their rationale generation and evaluation abilities. The results suggest that pre-trained LLMs possess untapped reasoning potential that can be unlocked through this self-rewarding optimization. While computationally intensive due to multiple rationale sampling during training, LaTRO offers a promising direction for developing more capable and self-evolving LLMs. Future work could explore more efficient sampling or adaptive rationale generation, and applying LaTRO to a broader range of tasks.