Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (2411.04282v2)

Published 6 Nov 2024 in cs.AI, cs.CL, cs.LG, and stat.ML

Abstract: LLMs have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \url{https://github.com/SalesforceAIResearch/LaTRO}.

This paper introduces LaTRO (Latent Reasoning Optimization), a novel framework designed to enhance the reasoning capabilities of LLMs without relying on external feedback or pre-existing reward models. The core idea is to treat the reasoning process (the "thoughts" or "rationales" an LLM generates) as sampling from a latent distribution. LaTRO then optimizes this process using a variational approach, enabling the LLM to simultaneously improve its ability to generate high-quality reasoning steps and to evaluate the quality of these steps. This self-improvement loop is termed "self-rewarding."

The authors argue that while prompt-based methods like Chain-of-Thought (CoT) improve reasoning at inference time, optimizing these capabilities during training remains difficult due to the scarcity of high-quality reasoning data and the challenges of developing accurate external reward models for reinforcement learning. LaTRO addresses this by leveraging the LLM's own probability estimates as a reward signal.

Methodology:

  1. Problem Formulation: Standard LLM fine-tuning maximizes the likelihood of generating a correct answer yy given a query xx, i.e., logπθ(yx)\log \pi_\theta(y | x). LaTRO introduces a latent reasoning rationale zz and aims to optimize the LLM πθ\pi_\theta by maximizing the expected log-likelihood of the answer given the query and the self-generated rationale, while regularizing the rationale generation process.
  2. Variational Approach: The objective logπθ(yx)\log \pi_\theta(y | x) is lower-bounded by an expression involving an auxiliary distribution q(zx)q(z|x) for the rationales: Eq(zx)[logπθ(yxz)]DKL[q(zx)π0(zx)]E_{q(z | x)}\big[\log \pi_{\theta}(y | x\oplus z)\big] - D_{KL}[q(z |x)|| \pi_0(z | x)], where π0\pi_0 is a prior reference LLM (typically the initial state of πθ\pi_\theta).
  3. Self-Rewarding Objective: LaTRO simplifies this by setting the "reasoner" q(zx)q(z|x) to be the LLM πθ(zx)\pi_\theta(z|x) itself. The optimization objective becomes: J(θ)=E(x,y)DGold[Ezπθ(x)[logπθ(yxz)Rθ(z,y,x)]DKL[πθ(zx)π0(zx)]]J(\theta) = E_{(x, y)\sim D_{\text{Gold}}} \left[ E_{z \sim \pi_{\theta}(\cdot | x)}\big[\underbrace{\log \pi_{\theta}(y | x\oplus z)}_{R_\theta(z, y, x)}\big] - D_{KL}[\pi_\theta(z |x)|| \pi_0(z | x)] \right]. Here, Rθ(z,y,x)=logπθ(yxz)R_\theta(z, y, x) = \log \pi_{\theta}(y | x\oplus z) acts as a self-generated reward: rationales zz that lead to a higher probability of the correct answer yy are considered better.
  4. Gradient Estimation: The gradient of J(θ)J(\theta) involves two main parts:
    • A policy gradient term for optimizing the rationale generator πθ(zx)\pi_\theta(z|x), using the REINFORCE Leave-One-Out (RLOO) estimator to reduce variance. The advantage Ak(i)A_k^{(i)} for a sampled rationale zk(i)z_k^{(i)} is calculated based on its reward r(zk(i))r(z_k^{(i)}) compared to the average reward of other K1K-1 sampled rationales. The reward r(z)r(z) includes the self-reward term and the KL-divergence penalty: r(zk(i)):=logπθ(yi  xizk(i))βlogπθ(zk(i)  xi)π0(zk(i)  xi)r(z_k^{(i)}) := \log \pi_\theta(y_i~|~x_i \oplus z_{k}^{(i)}) - \beta \log \frac{\pi_{\theta}(z_k^{(i)} ~|~x_i )}{\pi_{0}(z_k^{(i)} ~|~ x_i)}.
    • A maximum likelihood estimation (MLE) term for optimizing the LLM's ability to produce the correct answer yiy_i given the query xix_i and the sampled rationale zk(i)z_k^{(i)}. The full empirical gradient estimator is: θJ^(θ):=1NKi=1Nk=1K(θlogπθ(zk(i)  xi)Ak(i)+θlogπθ(yi  xizk(i)))\nabla_{\theta} \widehat{J}(\theta) := \frac{1}{NK}\sum_{i=1}^{N}\sum_{k=1}^{K}\bigg( \nabla_\theta \log \pi_{\theta}(z_k^{(i)} ~|~ x_i)\cdot A_k^{(i)} + \nabla_\theta \log \pi_\theta (y_i ~|~ x_i \oplus z_k^{(i)} ) \bigg).
  5. Practical Implementation:
    • During training, for each data point (xi,yi)(x_i, y_i), KK rationales zk(i)z_k^{(i)} are sampled from the current LLM πθ\pi_\theta.
    • Rationales are truncated to a maximum length LL or until an end-of-answer token.
    • The model parameters θ\theta are updated using the estimated gradient. The overall algorithm is summarized in Algorithm 1.

Experiments and Results:

LaTRO was evaluated on GSM8K (mathematical reasoning) and ARC-Challenge (logical reasoning) datasets using Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B models.

  • Baselines:

1. Base model (zero-shot CoT). 2. Supervised Fine-Tuning (SFT): On GSM8K, SFT used golden rationales. On ARC-Challenge, SFT trained to directly generate answers (no golden rationales available).

  • Key Findings on GSM8K:
    • LaTRO significantly improved zero-shot accuracy over base models (average +12.5% for greedy decoding, +13.1% for self-consistency maj@8).
    • LaTRO outperformed SFT models (average +9.6% for greedy decoding, +13.2% for self-consistency maj@8).
    • For example, with Phi-3.5, LaTRO achieved 87.6% (GD) and 90.5% (maj@8) compared to base model's 72.9% (GD) and 74.0% (maj@8), and SFT's 75.8% (GD) and 77.1% (maj@8).
  • Key Findings on ARC-Challenge:
    • LaTRO showed improvements, though smaller than on GSM8K.
    • Over base models: average +1.0% (GD), +2.4% (maj@8).
    • Over SFT models (which performed worse than base models): average +5.2% (GD), +8.1% (maj@8).
    • For example, with Phi-3.5, LaTRO achieved 86.4% (GD) and 87.5% (maj@8) compared to base model's 85.1% (GD) and 86.0% (maj@8), and SFT's 81.0% (GD) and 80.5% (maj@8).
  • Ablation Studies (on GSM8K with Phi-3.5):
    • Maximum Generation Length (L): Accuracy gains plateaued for L500L \geq 500 tokens. Training with a shorter length (e.g., L=200L=200, denoted LaTRO200\text{LaTRO}_{200}) still improved performance under this constraint, suggesting LaTRO can train models to produce more concise rationales.
    • Inference-time Scaling (Self-Consistency): LaTRO-trained models still benefited from sampling multiple rationales at inference time (self-consistency), with performance improving up to k=8k=8 samples.
  • Case Study: Qualitative analysis showed LaTRO-trained models generated more concise and correct reasoning steps compared to base and SFT models, which often made logical or arithmetic errors.

Contributions:

  1. A theoretical formulation linking LLM reasoning optimization to latent variable models.
  2. A self-rewarding mechanism using the model's own probability estimates, eliminating the need for external reward models or human feedback for rationale quality.
  3. Demonstration of significant performance gains on reasoning tasks across multiple LLM architectures, unlocking latent reasoning capabilities.

Conclusion:

LaTRO presents a principled and effective method for enhancing LLM reasoning by enabling models to self-improve their rationale generation and evaluation abilities. The results suggest that pre-trained LLMs possess untapped reasoning potential that can be unlocked through this self-rewarding optimization. While computationally intensive due to multiple rationale sampling during training, LaTRO offers a promising direction for developing more capable and self-evolving LLMs. Future work could explore more efficient sampling or adaptive rationale generation, and applying LaTRO to a broader range of tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Haolin Chen (14 papers)
  2. Yihao Feng (35 papers)
  3. Zuxin Liu (43 papers)
  4. Weiran Yao (31 papers)
  5. Akshara Prabhakar (13 papers)
  6. Shelby Heinecke (37 papers)
  7. Ricky Ho (3 papers)
  8. Phil Mui (5 papers)
  9. Silvio Savarese (200 papers)
  10. Caiming Xiong (337 papers)
  11. Huan Wang (211 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com