Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent-Variable Inference for Chain-of-Thought

Updated 8 February 2026
  • The paper formalizes LLM reasoning as latent-variable inference for chain-of-thought, enabling principled marginalization and robust fine-tuning.
  • It details EM, variational, and RL algorithms that reduce gradient variance and improve sample efficiency in both language and vision-language models.
  • Empirical results on benchmarks like GSM8K, BBH, and Math500 demonstrate significant accuracy gains and efficiency improvements over baseline methods.

Training chain-of-thought (CoT) via latent-variable inference refers to the probabilistic modeling paradigm in which the intermediate reasoning processes of LLMs are formalized as latent variables. This approach enables marginalization over unobserved rationales, principled amortized inference, and a range of EM, variational, and RL-based fine-tuning strategies for both language and vision-LLMs. Below, key theoretical and methodological facets are detailed, with representative models and empirical results.

1. Probabilistic Foundations: CoT as Latent-Variable Modeling

The central premise is to model the answer yy to an input xx as being mediated by a stochastic reasoning chain or rationale zz. Formally, the generative model factorizes as

pθ(y,zx)=pθ(zx)pθ(yx,z)p_\theta(y, z \mid x) = p_\theta(z \mid x) \, p_\theta(y \mid x, z)

with zz ranging over all possible textual or structural rationales consistent with xx. In practice, the log-marginal likelihood objective,

L(θ)=logpθ(yx)=logzpθ(zx)pθ(yx,z),\mathcal{L}(\theta) = \log p_\theta(y \mid x) = \log \sum_z p_\theta(z \mid x) p_\theta(y \mid x, z),

is intractable, prompting the use of approximate inference—typically expectation-maximization (EM), variational inference, or their amortized, RL-augmented, or multi-sample variants (Phan et al., 2023, Tang et al., 25 Mar 2025, Xu et al., 2023, Sun et al., 27 Oct 2025, Yao et al., 5 May 2025, Wu et al., 10 Jul 2025, Liu et al., 29 Sep 2025).

2. Training Algorithms: EM, Variational, and RL Approaches

Latent-variable CoT training bifurcates into several algorithmic strategies:

EM-based Algorithms

Classic EM-style solutions, as in TRICE (Phan et al., 2023), maintain per-example latent rationale chains and alternate between:

  • E-step: Approximate pθ(zx,y)p_\theta(z \mid x, y) via MCMC (e.g., Metropolis–Hastings using pθ(zx)p_\theta(z \mid x) as proposal),
  • M-step: Update θ\theta using expected gradients under the imputed zz.

Variance reduction is achieved by control-variate techniques, with the variance of the gradient estimator provably vanishing as the model improves (Phan et al., 2023).

Variational Inference

Variational techniques introduce a tractable inference distribution qϕ(zx,y)q_\phi(z \mid x, y) to lower-bound logpθ(yx)\log p_\theta(y \mid x): LELBO=Eqϕ(zx,y)[logpθ(z,yx)logqϕ(zx,y)].\mathcal{L}_{\rm ELBO} = \mathbb{E}_{q_\phi(z \mid x, y)} \left[\log p_\theta(z, y \mid x) - \log q_\phi(z \mid x, y)\right]. JEPO (Tang et al., 25 Mar 2025) simplifies this to Jensen's lower bound by setting qϕ(zx,y)=pθ(zx)q_\phi(z \mid x, y) = p_\theta(z \mid x), yielding a pragmatic, scalable estimator without auxiliary inference nets. GFlowNet-based amortized inference for sequential or multimodal CoT is also adopted (Sun et al., 27 Oct 2025).

Distributional and Reinforcement-Learning Extensions

CoT is further cast as a Markov Decision Process (MDP) with latent states (CTRLS) (Wu et al., 10 Jul 2025), or as a Markov chain of continuous thoughts (MARCOS) (Liu et al., 29 Sep 2025). On-policy RL and distributional RL (e.g., entropy-regularized policy gradient, epsilon-greedy exploration) are used to refine the latent transition dynamics, supporting diverse trajectory exploration and robust reasoning.

3. Model Architectures and Inference

Amortized Encoders and Latent-Skill Modeling

In LaRS (Xu et al., 2023), rationales are embedded into a latent skill space zRd\mathbf{z} \in \mathbb{R}^d via a conditional variational autoencoder (CVAE):

  • Encoder qω(zQ,R)q_\omega(\mathbf{z} \mid Q, R) maps question-rationale pairs to latent skill vectors.
  • Prior/Policy πϕ(zQ)\pi_\phi(\mathbf{z} \mid Q) predicts the skill distribution conditioned on the new input.
  • Decoder pψ(Rz,Q)p_\psi(R \mid \mathbf{z}, Q) reconstructs rationales from (Q, z).

At inference, latent skills inferred from the test question steer example selection by matching latent skills to the example bank (cosine similarity), leading to highly efficient and theory-grounded in-context learning pipelines.

Latent Markov and Continuous-State Models

MARCOS (Liu et al., 29 Sep 2025) employs continuous latent thoughts z1:T\mathbf{z}_{1:T} transitioned via a “Thinker” transformer, with each emission of observable CoT steps mediated by a “Speaker” model. The approximate posterior is parameterized as qϕ(Rtxt)q_\phi(R_t \mid x_t), with training consisting of both local pretraining (omitting prior KL) and global ELBO maximization.

Parallel architectures, such as PCCoT (Wu et al., 23 Jun 2025), further decouple sequential dependency via Jacobi iteration, yielding order-of-magnitude efficiency gains while maintaining accuracy.

Manifold and Quality-Based Latent Steering

GeoSteer (Kazama et al., 15 Jan 2026) maps hidden states of CoT segments into a low-dimensional VAE manifold, learns a differentiable segment-quality function, and applies latent gradient-based steering at each inference step. This approach aligns internal activations with regions of high-quality reasoning, preserving both geometric coherence and quantitative answer accuracy.

4. Evidence Lower Bounds, Gradient Estimation, and Variance Reduction

A central technical feature is the use of evidence lower bounds (ELBOs) and their optimization via stochastic gradient estimators. For instance:

  • EM/ELBO Gradient: The key estimator computes

θlogpθ(yx)=Ezpθ(zx,y)[θlogpθ(zx)],\nabla_\theta \log p_\theta(y \mid x) = \mathbb{E}_{z \sim p_\theta(z \mid x, y)} \left[ \nabla_\theta \log p_\theta(z \mid x) \right],

with practical algorithms approximating this via MCMC or variational sampling (Phan et al., 2023, Tang et al., 25 Mar 2025).

  • Variance Reduction: Control-variate (score function) methods use baseline terms (e.g., acceptance rates or leave-one-out averages) to minimize estimator variance (Phan et al., 2023, Yao et al., 5 May 2025).
  • Dynamic Sampling: GVM-RAFT (Yao et al., 5 May 2025) allocates sampling budgets adaptively using gradient-norm and acceptance-rate analysis, provably accelerating convergence for gradient-based EM and RL.

5. Inference-Time Strategies and Prompt Selection

In models such as LaRS (Xu et al., 2023), the amortized inference network assigns latent skill vectors to queries and to banked examples, enabling similarity-based retrieval of supporting rationales without laborious hand-labeling or repeated LLM calls. Bayesian inference-scaling and best-of-nn marginal likelihood approximations further enable cost-effective, robust answer selection in multimodal and autoregressive settings (Sun et al., 27 Oct 2025).

GeoSteer (Kazama et al., 15 Jan 2026) demonstrates that steering model activations in the latent manifold space increases both the coherence of intermediate steps and the exact-match metric on math reasoning benchmarks.

6. Empirical Results and Applications

A variety of datasets and models have validated these latent-variable techniques:

Model Dataset(s) Relative Gains Notable Features
TRICE GSM8K, BBH +5–10 pt over baselines MCMC-EM, CV variance reduction
LaRS TabMWP, GSM8K, Spider, COGS Up to +6 pt over non-latent retrieval CVAE skill modeling, fast ICL
MARCOS GSM8K, SVAMP, MultiArith +4.7 pt over strong SFT, 15× speedup Continuous Markov chain of thoughts
GeoSteer GSM8k +2.6 pt EM, +5.3 pairwise win VAE manifold steering
GVM-RAFT Math500, Minerva 2–4× faster, +2–3 pt accuracy Dynamic sample allocation
CTRLS GSM8K, MATH +10% in exploration/sample efficiency Transition-aware latent RL

All performance claims verbatim from cited sources; actual statistical confidence intervals can be found in the referenced works.

7. Significance and Outlook

Latent-variable-based CoT training provides a rigorous, flexible foundation for reasoning in LLMs, supporting principled marginalization, efficient fine-tuning absent gold rationales, and robust exploration of diverse reasoning trajectories. The framework accommodates a wide spectrum of practical instantiations—EM-MCMC, amortized inference, variational and GFlowNet losses, continuous-state Markov models, manifold steering, and on-policy RL. Recent advances resolve sampling and variance bottlenecks and yield statistically significant improvements in accuracy and efficiency across math, text, SQL, and multimodal reasoning benchmarks (Phan et al., 2023, Xu et al., 2023, Liu et al., 29 Sep 2025, Kazama et al., 15 Jan 2026, Yao et al., 5 May 2025, Sun et al., 27 Oct 2025, Wu et al., 10 Jul 2025, Tang et al., 25 Mar 2025, Wu et al., 23 Jun 2025).

A plausible implication is the emergence of new algorithmic paradigms unifying planning, inference, and learning across symbolic, continuous, and black-box latent spaces—potentially facilitating further advances in scalable, interpretable, and controllable multi-step reasoning under uncertainty.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Training Chain-of-Thought via Latent-Variable Inference.