Continuous CoT with Soft/Parallel Sampling

Updated 17 December 2025

Continuous chain-of-thought with soft/parallel sampling is a reasoning paradigm in LLMs where dense vectors represent parallel mixtures of multiple computation paths.
It leverages superposition and parallel updates in transformer architectures to achieve exponential speedups in combinatorial search and graph traversal tasks.
The approach integrates variational training, contrastive objectives, and reinforcement learning to enhance exploration, efficiency, and accuracy.

Continuous Chain-of-Thought with Soft/Parallel Sampling (Continuous CoT with Superposition)

Continuous chain-of-thought with soft or parallel sampling refers to a class of reasoning algorithms for LLMs in which each step of intermediate reasoning is represented as a dense, high-dimensional vector—not a discrete token. Instead of advancing a single, sampled reasoning trace at each step (as in classical chain-of-thought or CoT), the model evolves a “thought vector” representing a soft superposition or mixture over many candidate traces. This mechanism enables the model to explore multiple reasoning paths implicitly and in parallel, yielding large gains in efficiency, expressivity, and robustness. Rigorous theoretical analysis and empirical results demonstrate that such continuous superpositions confer exponential speedups in combinatorial search problems and unlock new scaling behaviors not achievable by discrete CoT methods (Zhu et al., 18 May 2025, Zhu et al., 27 Sep 2025, Gozeten et al., 29 May 2025).

1. Theoretical Foundation: Superposition and Information Packing

Continuous CoT fundamentally leverages the property that high-dimensional vector spaces can encode convex combinations of many discrete states simultaneously. Formally, at each reasoning step $t$ , the model computes a distribution $\alpha_t \in \Delta^{v-1}$ over $v$ vocabulary items and forms a continuous token:

$z_t = \sum_{i=1}^v \alpha_{t,i} e_i \in \mathbb{R}^d,$

where $e_i$ are the word embeddings. In this “superposition state,” $z_t$ encodes, up to the information capacity of $d$ and the geometry of $E$ , a mixture of $B \sim d/\log(v)$ distinct reasoning paths (Gozeten et al., 29 May 2025).

A key theoretical result is that, given sufficient embedding dimension, continuous CoT can robustly represent and expand all members of an exponentially large frontier, such as all vertices at distance $c$ in a graph (BFS frontiers) or all partial sums in subset-sum problems (Zhu et al., 18 May 2025, Gozeten et al., 29 May 2025). This is impossible for discrete CoT, which can only emit one path per time step.

For example, in directed graph reachability, the $c$ th continuous thought vector is provably equal to the normalized superposition state over all reachable nodes within $c$ steps:

$[t^c] = \frac{1}{\sqrt{|V_c|}} \sum_{v\in V_c} u_v,$

with $u_v$ being the vertex embedding (Zhu et al., 18 May 2025). This allows the model to propagate all search frontiers in lockstep using parallel computation.

2. Algorithmic Realizations and Model Architectures

Autoregressive Superposition in Transformers

Canonical constructions involve two-layer transformers where, at every thought step, attention and MLP mechanisms are orchestrated to aggregate and propagate the desired superpositions (Zhu et al., 18 May 2025). The update rule at step $c$ is:

Use attention to collect all outgoing edges/statuses from the current frontier $V_c$ .
Use the MLP to normalize and denoise, forming the next superposed frontier $V_{c+1}$ :

$[t^{c+1}] = \frac{1}{\sqrt{|V_{c+1}|}} \sum_{v \in V_{c+1}} u_v.$

This mechanism is instantiated in models such as Coconut, where each continuous thought vector tracks the entire search frontier, enabling a parallel BFS that can solve reachability in $D$ steps for graph diameter $D$ instead of $O(n^2)$ for discrete CoT (Zhu et al., 18 May 2025, Zhu et al., 27 Sep 2025).

Latent Markov Chains and Variational Training

MARCOS (Liu et al., 29 Sep 2025) introduces a Markov chain of continuous “thoughts” $z_0, z_1, \dots, z_K$ , each updated via a transition function $z_k = Thinker(z_{k-1}, R_k, H^{in})$ with randomness $R_k\sim\mathcal{N}(\mu_k,\text{diag}(\sigma_k^2))$ . This framework decouples the reasoning process from token emission, eschewing the autoregressive token bottleneck, and enables parallel chains (batching over different initial random seeds) for soft/parallel sampling at the thought-vector level.

3. Training Paradigms for Soft/Parallel Sampling

Supervision and Reinforcement Learning

Continuous CoT models can be trained in several ways:

Continuous Supervision: Match the model's internal distributions at each step to empirical distributions over states visited by the top- $B$ trajectories (CSFT), directly supervising soft exploration (Gozeten et al., 29 May 2025).
Contrastive Objectives: Use InfoNCE or margin-based losses to maximize the diversity of parallel latent thoughts (e.g., SoftCoT++ (Xu et al., 16 May 2025)).
Variational/ELBO: Incorporate per-step variational posteriors over the randomness controlling transitions in Markovian continuous chains (e.g., MARCOS (Liu et al., 29 Sep 2025)).
Reinforcement Learning: Employ policy gradient over trajectories defined by continuous or noisy embeddings ("soft tokens"), using Gaussian noise for exploration. Reward can be applied to the final answer only or to intermediate reasoning steps with learned reward aggregators (Butt et al., 23 Sep 2025, You et al., 9 Oct 2025).

Soft and Parallel Sampling Mechanisms

Sampling in Continuous Space: At each step, sample from the distribution over discrete tokens multiple times, combine them into superposed embeddings (multi-token sampling), or sample directly from a Dirichlet over logits for maximum entropy (Gozeten et al., 29 May 2025).
Jacobi Iteration/Parallel Updates: Use Jacobi-style parallel updates of latent thought tokens, allowing all $c$ thoughts to be refined in parallel for improved efficiency and stability (PCCoT; (Wu et al., 23 Jun 2025)).
Stochastic Forward Passes: Apply dropout (epistemic noise) or additive Gaussian noise (aleatoric noise) in the latent models at test time to sample diverse trajectories; aggregate via voting or learned reward models (LatentRM) (You et al., 9 Oct 2025).

4. Practical Algorithms and Scaling Behavior

The table below summarizes core approaches to soft/parallel sampling in continuous CoT, their sampling strategies, and characteristic empirical outcomes.

Model/Approach	Sampling Strategy	Scaling & Efficiency
Coconut (Zhu et al., 18 May 2025)	Superposed autoregressive	Exponential speedup in BFS; D steps for diameter D
CoT² (Gozeten et al., 29 May 2025)	Multi-token/Dirichlet	$O(d)$ parallel paths; theory-guided tradeoff with $d$ , $B$
PCCoT (Wu et al., 23 Jun 2025)	Jacobi parallel iteration	$\sim$ 2x faster training/inference, equal or better accuracy
MARCOS (Liu et al., 29 Sep 2025)	Latent Markov chain w/ batch sampling	4.7% better accuracy, up to 15.7x faster vs discrete CoT
SoftCoT++ (Xu et al., 16 May 2025)	Multiple initial tokens + contrastive	+2.1pp accuracy over discrete TTS (100 chains)
LatentRM (You et al., 9 Oct 2025)	MC-dropout / AGN + learned scorer	~2pt coverage improvement vs non-parametric voting

Soft/parallel sampling generally reduces the number of serial reasoning steps required (from $O(n^2)$ to $O(D)$ in graph tasks, and from hundreds to tens in complex arithmetic), and increases robustness and diversity as measured by pass@k or coverage@N metrics (Gozeten et al., 29 May 2025, You et al., 9 Oct 2025, Butt et al., 23 Sep 2025).

5. Empirical Findings and Scaling Laws

Empirical analyses reveal:

Higher Pass@k and Diversity: Soft/fuzzy RL produces higher pass@32 than hard RL or discrete CoT. For instance, on GSM8K, soft training yields up to +3.8 points in pass@32 over discrete CoT (Butt et al., 23 Sep 2025).
Efficiency in Step Count: Coconut solves directed graph reachability in $D$ steps (graph diameter) rather than $O(n^2)$ (Zhu et al., 18 May 2025).
Capacity-Dimension Tradeoff: For $v=16$ token tasks, $d=32$ achieves full $B=16$ path parallelism, matching the theoretical packing bound (Gozeten et al., 29 May 2025).
Speed-Accuracy Tradeoff: MARCOS achieves up to $15.7\times$ inference speedup compared to discrete CoT at comparable or better accuracy (Liu et al., 29 Sep 2025).
Combining with Self-Consistency: SoftCoT++ demonstrates that “thinking-stage” diversity (in soft latent space) offers stronger gains when combined with discrete self-consistency at the reasoning/answering stage (Xu et al., 16 May 2025).

6. Training Dynamics and Natural Emergence of Superposition

Gradient-based training of continuous CoT models exhibits dynamics that favor the emergence of well-calibrated superpositions over plausible reasoning paths. Analytical results show that the “index-matching logit” governing the sharpness of the distribution over candidate reasoning traces naturally stabilizes at a positive, bounded value (unless forced to become deterministic by the loss), balancing exploration and exploitation (Zhu et al., 27 Sep 2025). This bounded logit produces distributed attention across multiple plausible next-steps, enabling robust superposition in the continuous thoughts even when trained to supervise only single paths.

Empirically, trained models exhibit sharp separation between unreachable, reachable, and optimal nodes in their internal representations, and attention maps display focused yet non-singular concentration on active search fronts (Zhu et al., 18 May 2025, Zhu et al., 27 Sep 2025).

7. Integration with Parallelization and Modern Inference Techniques

The parallel nature of soft sampling in continuous CoT naturally lends itself to test-time scaling and massive throughput. Methods such as PCCoT leverage Jacobi iteration to update all latent thoughts simultaneously, reducing training and inference time by nearly 50% at matched accuracy (Wu et al., 23 Jun 2025). Stochastic sampling (dropout, noise) is batched for hundreds of diverse trajectories per prompt, and aggregation is performed by learned reward models or majority voting (You et al., 9 Oct 2025, Butt et al., 23 Sep 2025).

Further, soft/fuzzy RL training preserves base-model performance on out-of-domain tasks, minimizing catastrophic forgetting—a challenge in full-model fine-tuning regimes (Butt et al., 23 Sep 2025). In deployment, models trained with soft/parallel sampling are compatible with standard LLM infrastructures since final answer emission can be performed discretely.

Continuous CoT with soft/parallel sampling generalizes and strictly dominates the classical discrete CoT paradigm on both theoretical and empirical grounds. By encoding and propagating reasoning as superpositions in continuous space, these models facilitate efficient implicit parallel search, gain substantial inference and coverage improvements, and provide a foundation for new scalable, hybrid, and RL-driven reasoning algorithms in LLMs (Zhu et al., 18 May 2025, Gozeten et al., 29 May 2025, Zhu et al., 27 Sep 2025, Liu et al., 29 Sep 2025, Wu et al., 23 Jun 2025, You et al., 9 Oct 2025, Xu et al., 16 May 2025, Butt et al., 23 Sep 2025).