Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 210 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Variational Reasoning for Language Models (2509.22637v1)

Published 26 Sep 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce a variational reasoning framework for LLMs that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of LLMs. Our code is available at https://github.com/sail-sg/variational-reasoning.

Summary

  • The paper introduces a unified probabilistic framework that models reasoning as latent 'thinking traces' to improve final answer accuracy.
  • It leverages an IWAE-style multi-trace objective and forward KL training to stabilize and debias reasoning optimization.
  • Empirical results show improved performance and stability across diverse benchmarks, outperforming existing methods by up to 160% on complex tasks.

Variational Reasoning for LLMs: A Probabilistic Framework for Reasoning Optimization

Introduction

This paper presents a unified probabilistic framework for training LLMs to perform complex reasoning, leveraging variational inference to treat "thinking traces" as latent variables. The approach formalizes reasoning as a joint generative process over both the intermediate thinking steps and the final answer, and introduces a variational posterior to efficiently sample high-quality reasoning paths. The framework generalizes and connects existing supervised finetuning (SFT), rejection sampling finetuning (RFT), and reinforcement learning (RL) methods, revealing implicit biases and providing tighter, more stable training objectives.

Probabilistic Formulation and Variational Inference

The reasoning process is decomposed into a thinking trace zz and an answer yy, with the model πθ(z,yx)\pi_\theta(z, y | x) generating both given a question xx. The marginal probability of producing a correct answer is Pθ(Yx)=zπθ(Yz,x)πθ(zx)P_\theta(\mathcal{Y}|x) = \sum_z \pi_\theta(\mathcal{Y}|z, x)\pi_\theta(z|x), where Y\mathcal{Y} is the set of correct answers. Direct maximization of logPθ(Yx)\log P_\theta(\mathcal{Y}|x) is intractable due to the sum over all possible traces.

To address this, the paper derives an evidence lower bound (ELBO) using a variational posterior qϕ(zx,y)q_\phi(z|x, y') conditioned on the question and an answer hint yy'. The ELBO is:

logPθ(Yx)Eqϕ(zx,y)[logπθ(Yz,x)]DKL(qϕ(zx,y)πθ(zx))\log P_\theta(\mathcal{Y}|x) \geq \mathbb{E}_{q_\phi(z|x, y')} \left[ \log \pi_\theta(\mathcal{Y}|z, x) \right] - \mathbb{D}_{\mathrm{KL}}(q_\phi(z|x, y') \| \pi_\theta(z|x))

This formulation enables tractable optimization and allows the variational posterior to focus on reasoning paths likely to yield correct answers.

IWAE-Style Multi-Trace Extension

The framework is extended to an IWAE-style multi-trace objective, sampling KK traces per question to tighten the lower bound:

LELBOK=Ez1:Kqϕ(zx,y)[log1Kk=1Kπθ(zk,Yx)qϕ(zkx,y)]\mathcal{L}_{\mathrm{ELBO}^K} = \mathbb{E}_{z_{1:K} \sim q_\phi(z|x, y')} \left[ \log \frac{1}{K} \sum_{k=1}^K \frac{\pi_\theta(z_k, \mathcal{Y}|x)}{q_\phi(z_k|x, y')} \right]

Gradients are computed using normalized importance weights ρ~k\widetilde{\rho}_k, with practical estimators for πθ(Yzk,x)\pi_\theta(\mathcal{Y}|z_k, x) based on either accuracy or geometric mean of token likelihoods to mitigate length bias and variance.

Forward KL Training for Variational Posterior

Empirical observations show that reverse KL optimization of qϕq_\phi can lead to collapse or shortcut reasoning. The paper proposes optimizing qϕq_\phi via forward KL divergence:

DKL(Pθ(zY,x)qϕ(zx,y))\mathbb{D}_{\mathrm{KL}}(P_\theta(z|\mathcal{Y}, x) \| q_\phi(z|x, y'))

This is implemented as weighted SFT, sampling traces from πθ\pi_\theta and weighting by answer accuracy, which stabilizes training and prevents collapse.

Connections to SFT, RFT, and RL

The framework reveals that RFT and binary-reward RL objectives are equivalent to forward KL optimization weighted by model accuracy, which biases training toward easier questions. In contrast, the variational reasoning objective treats all questions more evenly, reducing this bias. The analysis extends to GRPO and general RL reward shaping, showing that reward normalization further amplifies the bias toward high-accuracy (easy) questions.

Implementation Details

  • Model Architecture: The framework is implemented on Qwen2.5 and Qwen3 model families, with separate models for the reasoning policy πθ\pi_\theta and the variational posterior qϕq_\phi.
  • Training Pipeline: Initial models are trained via SFT, followed by variational posterior training with forward KL, and final reasoning model training using the IWAE-style objective and weighted SFT.
  • Sampling and Weighting: For each question, multiple traces are sampled from qϕq_\phi, and importance weights are computed using accuracy-based or geometric mean estimators.
  • Prompt Engineering: Robustness to prompt templates is demonstrated, with both "Solution/Explanation" and "Solution/Thought" formats yielding similar results.
  • Computational Considerations: Scaling the number of traces KK improves performance but increases computational cost, requiring trade-offs in practice.

Empirical Results

  • Performance: The variational reasoning framework consistently outperforms strong baselines (Bespoke-Stratos, General-Reasoner, RLT) across math, code, and general reasoning benchmarks, with up to 160% improvement over base models and 8.5% over the best baseline on average accuracy.
  • Generalization: Gains extend to out-of-distribution tasks (GPQA-Diamond, MMLU-Pro), indicating robust reasoning improvements.
  • Stability: Training loss and gradient norms are more stable compared to baselines, attributed to adaptive importance weighting. Figure 1

    Figure 1: Training loss and gradient norm of different methods during Qwen3-Base model training, showing improved stability for variational reasoning.

  • Pass@K Analysis: The advantage of variational reasoning increases with larger KK on complex tasks, while diminishing on simpler or multiple-choice tasks. Figure 2

    Figure 2: Pass@K comparison of baselines versus variational reasoning, highlighting superior performance on complex benchmarks.

  • Scaling Effects: Increasing the number of sampled traces KK yields further accuracy improvements, with diminishing returns at high KK. Figure 3

    Figure 3: Effects of scaling up the number of thinking traces (KK) on final model performance.

  • Estimator Ablations: Accuracy-based and geometric mean estimators outperform naive likelihood, validating theoretical analysis.
  • Length Bias: Density maps reveal strong correlation between trace length and likelihood ratios, justifying estimator choices. Figure 4

    Figure 4: Density maps of thinking token length versus log-likelihood ratio, and answer token length versus log-likelihood, illustrating length bias.

Theoretical and Practical Implications

The framework provides a principled probabilistic perspective that unifies variational inference and RL-style methods for reasoning. It clarifies the implicit biases in existing approaches and offers stable, scalable objectives for improving reasoning ability. The analysis of estimator variance and bias informs practical implementation choices, and the connection to reward shaping in RL suggests avenues for debiasing and more equitable training.

Future Directions

Potential extensions include multi-round training (beyond T=1T=1), richer posterior designs for answer hints, and application to broader domains. The framework's generality suggests applicability to agentic reasoning, program synthesis, and scientific discovery tasks.

Conclusion

The variational reasoning framework advances the training of reasoning LLMs by formalizing thinking traces as latent variables and optimizing via variational inference. It achieves strong empirical gains, improved stability, and theoretical clarity, while subsuming and debiasing existing SFT, RFT, and RL methods. This work lays the foundation for principled, scalable reasoning optimization in LLMs.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 7 tweets and received 54 likes.

Upgrade to Pro to view all of the tweets about this paper:

alphaXiv