Latent-Variable Inference for Chain-of-Thought
- The paper formalizes LLM reasoning as latent-variable inference for chain-of-thought, enabling principled marginalization and robust fine-tuning.
- It details EM, variational, and RL algorithms that reduce gradient variance and improve sample efficiency in both language and vision-language models.
- Empirical results on benchmarks like GSM8K, BBH, and Math500 demonstrate significant accuracy gains and efficiency improvements over baseline methods.
Training chain-of-thought (CoT) via latent-variable inference refers to the probabilistic modeling paradigm in which the intermediate reasoning processes of LLMs are formalized as latent variables. This approach enables marginalization over unobserved rationales, principled amortized inference, and a range of EM, variational, and RL-based fine-tuning strategies for both language and vision-LLMs. Below, key theoretical and methodological facets are detailed, with representative models and empirical results.
1. Probabilistic Foundations: CoT as Latent-Variable Modeling
The central premise is to model the answer to an input as being mediated by a stochastic reasoning chain or rationale . Formally, the generative model factorizes as
with ranging over all possible textual or structural rationales consistent with . In practice, the log-marginal likelihood objective,
is intractable, prompting the use of approximate inference—typically expectation-maximization (EM), variational inference, or their amortized, RL-augmented, or multi-sample variants (Phan et al., 2023, Tang et al., 25 Mar 2025, Xu et al., 2023, Sun et al., 27 Oct 2025, Yao et al., 5 May 2025, Wu et al., 10 Jul 2025, Liu et al., 29 Sep 2025).
2. Training Algorithms: EM, Variational, and RL Approaches
Latent-variable CoT training bifurcates into several algorithmic strategies:
EM-based Algorithms
Classic EM-style solutions, as in TRICE (Phan et al., 2023), maintain per-example latent rationale chains and alternate between:
- E-step: Approximate via MCMC (e.g., Metropolis–Hastings using as proposal),
- M-step: Update using expected gradients under the imputed .
Variance reduction is achieved by control-variate techniques, with the variance of the gradient estimator provably vanishing as the model improves (Phan et al., 2023).
Variational Inference
Variational techniques introduce a tractable inference distribution to lower-bound : JEPO (Tang et al., 25 Mar 2025) simplifies this to Jensen's lower bound by setting , yielding a pragmatic, scalable estimator without auxiliary inference nets. GFlowNet-based amortized inference for sequential or multimodal CoT is also adopted (Sun et al., 27 Oct 2025).
Distributional and Reinforcement-Learning Extensions
CoT is further cast as a Markov Decision Process (MDP) with latent states (CTRLS) (Wu et al., 10 Jul 2025), or as a Markov chain of continuous thoughts (MARCOS) (Liu et al., 29 Sep 2025). On-policy RL and distributional RL (e.g., entropy-regularized policy gradient, epsilon-greedy exploration) are used to refine the latent transition dynamics, supporting diverse trajectory exploration and robust reasoning.
3. Model Architectures and Inference
Amortized Encoders and Latent-Skill Modeling
In LaRS (Xu et al., 2023), rationales are embedded into a latent skill space via a conditional variational autoencoder (CVAE):
- Encoder maps question-rationale pairs to latent skill vectors.
- Prior/Policy predicts the skill distribution conditioned on the new input.
- Decoder reconstructs rationales from (Q, z).
At inference, latent skills inferred from the test question steer example selection by matching latent skills to the example bank (cosine similarity), leading to highly efficient and theory-grounded in-context learning pipelines.
Latent Markov and Continuous-State Models
MARCOS (Liu et al., 29 Sep 2025) employs continuous latent thoughts transitioned via a “Thinker” transformer, with each emission of observable CoT steps mediated by a “Speaker” model. The approximate posterior is parameterized as , with training consisting of both local pretraining (omitting prior KL) and global ELBO maximization.
Parallel architectures, such as PCCoT (Wu et al., 23 Jun 2025), further decouple sequential dependency via Jacobi iteration, yielding order-of-magnitude efficiency gains while maintaining accuracy.
Manifold and Quality-Based Latent Steering
GeoSteer (Kazama et al., 15 Jan 2026) maps hidden states of CoT segments into a low-dimensional VAE manifold, learns a differentiable segment-quality function, and applies latent gradient-based steering at each inference step. This approach aligns internal activations with regions of high-quality reasoning, preserving both geometric coherence and quantitative answer accuracy.
4. Evidence Lower Bounds, Gradient Estimation, and Variance Reduction
A central technical feature is the use of evidence lower bounds (ELBOs) and their optimization via stochastic gradient estimators. For instance:
- EM/ELBO Gradient: The key estimator computes
with practical algorithms approximating this via MCMC or variational sampling (Phan et al., 2023, Tang et al., 25 Mar 2025).
- Variance Reduction: Control-variate (score function) methods use baseline terms (e.g., acceptance rates or leave-one-out averages) to minimize estimator variance (Phan et al., 2023, Yao et al., 5 May 2025).
- Dynamic Sampling: GVM-RAFT (Yao et al., 5 May 2025) allocates sampling budgets adaptively using gradient-norm and acceptance-rate analysis, provably accelerating convergence for gradient-based EM and RL.
5. Inference-Time Strategies and Prompt Selection
In models such as LaRS (Xu et al., 2023), the amortized inference network assigns latent skill vectors to queries and to banked examples, enabling similarity-based retrieval of supporting rationales without laborious hand-labeling or repeated LLM calls. Bayesian inference-scaling and best-of- marginal likelihood approximations further enable cost-effective, robust answer selection in multimodal and autoregressive settings (Sun et al., 27 Oct 2025).
GeoSteer (Kazama et al., 15 Jan 2026) demonstrates that steering model activations in the latent manifold space increases both the coherence of intermediate steps and the exact-match metric on math reasoning benchmarks.
6. Empirical Results and Applications
A variety of datasets and models have validated these latent-variable techniques:
| Model | Dataset(s) | Relative Gains | Notable Features |
|---|---|---|---|
| TRICE | GSM8K, BBH | +5–10 pt over baselines | MCMC-EM, CV variance reduction |
| LaRS | TabMWP, GSM8K, Spider, COGS | Up to +6 pt over non-latent retrieval | CVAE skill modeling, fast ICL |
| MARCOS | GSM8K, SVAMP, MultiArith | +4.7 pt over strong SFT, 15× speedup | Continuous Markov chain of thoughts |
| GeoSteer | GSM8k | +2.6 pt EM, +5.3 pairwise win | VAE manifold steering |
| GVM-RAFT | Math500, Minerva | 2–4× faster, +2–3 pt accuracy | Dynamic sample allocation |
| CTRLS | GSM8K, MATH | +10% in exploration/sample efficiency | Transition-aware latent RL |
All performance claims verbatim from cited sources; actual statistical confidence intervals can be found in the referenced works.
7. Significance and Outlook
Latent-variable-based CoT training provides a rigorous, flexible foundation for reasoning in LLMs, supporting principled marginalization, efficient fine-tuning absent gold rationales, and robust exploration of diverse reasoning trajectories. The framework accommodates a wide spectrum of practical instantiations—EM-MCMC, amortized inference, variational and GFlowNet losses, continuous-state Markov models, manifold steering, and on-policy RL. Recent advances resolve sampling and variance bottlenecks and yield statistically significant improvements in accuracy and efficiency across math, text, SQL, and multimodal reasoning benchmarks (Phan et al., 2023, Xu et al., 2023, Liu et al., 29 Sep 2025, Kazama et al., 15 Jan 2026, Yao et al., 5 May 2025, Sun et al., 27 Oct 2025, Wu et al., 10 Jul 2025, Tang et al., 25 Mar 2025, Wu et al., 23 Jun 2025).
A plausible implication is the emergence of new algorithmic paradigms unifying planning, inference, and learning across symbolic, continuous, and black-box latent spaces—potentially facilitating further advances in scalable, interpretable, and controllable multi-step reasoning under uncertainty.