Chain of Continuous Thought (Coconut)

Updated 15 March 2026

Chain of Continuous Thought (Coconut) is a continuous reasoning paradigm that replaces discrete token generation with a sequence of learned latent vectors.
It supports implicit parallelism by encoding superpositions of multiple hypotheses, enhancing efficiency in tasks like search and planning.
Variants such as PCCoT and SoftCoT++ leverage advanced training regimes to achieve faster inference and competitive accuracy versus traditional chain-of-thought methods.

A Chain of Continuous Thought (Coconut) is a paradigm in LLMs and vision-LLMs that replaces explicit, autoregressive reasoning steps in the vocabulary space (“discrete chain-of-thought,” CoT) with a compact sequence of continuous latent vectors—“continuous thought tokens.” This approach supports implicit parallelism, improves the efficiency of multi-step reasoning, and has theoretical and empirical advantages in tasks requiring search, planning, or inference beyond language modality.

1. Definition and Foundational Mechanism

Chain of Continuous Thought (Coconut) refers to a reasoning architecture in which the LLM executes intermediate computation via latent “thought” states in the model’s continuous embedding space, instead of generating stepwise natural-language tokens. Given an input $x = (x_1, \ldots, x_n)$ , Coconut appends a trainable begin-of-thought token <bot> and sequentially evolves $c$ continuous latent thought vectors, denoted $h_{n+1}, \ldots, h_{n+c}$ : $h_{n+1} = f([E_{x_1}; \ldots; E_{x_n}; E_{<bot>}])$

$h_{n+i+1} = f([E_{x_1}; \ldots; E_{x_n}; h_{n+1};\ldots; h_{n+i}])\,, \quad i=1\ldots c$

where $f(\cdot)$ is the transformer body up to but not including the LM head, and $E_{x_j}\in \mathbb{R}^d$ are learned embeddings. After constructing the latent chain, an <eot> token is appended, and answer tokens are generated autoregressively with the last hidden state as context (Wu et al., 23 Jun 2025, Hao et al., 2024).

By never mapping intermediate states back to the vocabulary, Coconut enables the model to internally represent complex, multi-path, or parallel reasoning steps within a low number of continuous tokens, eliminating the inefficiency and information loss inherent in discrete CoT (Gozeten et al., 29 May 2025).

2. Theoretical Properties and Superposition

The principal theoretical advance of Coconut is that each continuous thought vector can encode a superposition over many possible next-step hypotheses, enabling implicit parallel search. In directed graph reachability, this superposition allows a two-layer transformer to solve the problem in $O(D)$ continuous steps—where $D$ is the graph diameter—whereas discrete CoT requires $O(n^2)$ steps for $c$ 0 vertices (Zhu et al., 18 May 2025).

Concretely, for a reasoning frontier $c$ 1 (all vertices reachable in $c$ 2 hops), the latent token at step $c$ 3 is

$c$ 4

where $c$ 5 are orthonormal content embeddings. This vector encodes all candidate frontiers, supporting implicit parallel BFS. Empirical probes show that trained Coconut models realize such superpositional encodings without explicit supervision: inner products $c$ 6 are highest for nodes at the current search frontier, with a hierarchy for nodes on optimal or merely reachable paths (Zhu et al., 18 May 2025, Zhu et al., 27 Sep 2025).

3. Training Regimes, Optimization, and Variants

Coconut is typically instantiated via a staged curriculum: start with standard CoT supervised fine-tuning, then progressively replace language CoT steps with continuous thoughts, masking the loss on the question and latent steps, and computing cross-entropy only over remaining answer tokens. This ensures stable training and allows the model to bootstrap from explicit traces (Hao et al., 2024, Wu et al., 23 Jun 2025).

Alternatively, self-distillation approaches such as CODI align the hidden state associated with a special token (e.g., colon before the answer) between an explicit CoT (teacher) and the continuous CoT (student), using an L1 loss in feature space across layers. This achieves compression ratios up to $c$ 7—as continuous CoT requires only $c$ 8 steps—while preserving accuracy and improving robustness on both in-distribution and OOD benchmarks (Shen et al., 28 Feb 2025).

Further, SoftCoT++ extends continuous CoT by learning diversified latent reasoning chains at test-time: by using specialized initial tokens and a contrastive loss, SoftCoT++ simulates the diversity of self-consistency in discrete CoT, outperforming both single-sample and test-time scaled variants (Xu et al., 16 May 2025).

4. Parallelization: Jacobi Iteration and Efficiency

A major computational bottleneck of basic Coconut is its sequential decoding of latent thought vectors, which prohibits parallel training or inference. Parallel Continuous CoT (PCCoT) circumvents this by jointly updating all $c$ 9 latent tokens using Jacobi iteration: $h_{n+1}, \ldots, h_{n+c}$ 0 where $h_{n+1}, \ldots, h_{n+c}$ 1 and $h_{n+1}, \ldots, h_{n+c}$ 2 is the transformer applied to the input and the current block of latent tokens. Empirically, $h_{n+1}, \ldots, h_{n+c}$ 3 Jacobi iterations suffice to match or exceed sequential Coconut’s accuracy while nearly halving training and inference time (e.g., 13.7h vs. 24.9h for GSM8K-Aug, and 49.5% PCCoT vs. 48.2% sequential Coconut accuracy), with improved stability and lower run-to-run variance (Wu et al., 23 Jun 2025).

Table: Computational Cost and Accuracy (GSM8K-Aug, GPT-2.Small)

Method	Train Time (h)	Inference Time (s/batch)	Accuracy (%)
Discrete CoT	—	—	44.1
Continuous CoT (c=24)	24.9	0.443	48.2
PCCoT (c=24, T=3)	13.7	0.199	49.5

5. Causal Analysis and Limitations

Recent adversarial and causal studies challenge the assumption that latent tokens always capture explicit multi-step reasoning (Zhang et al., 25 Dec 2025, Li et al., 9 Feb 2026). Steering (causal) interventions—perturbing or swapping latent tokens—have minimal effect on the final answer compared to explicit CoT, with perturbation success rates (PSR) of 0–10% for Coconut versus up to 50–60% for discrete CoT. Shortcut tests (inducing option bias or spurious context) show that Coconut-trained models can exploit dataset artifacts, inflating benchmark performance without performing genuine stepwise reasoning.

Causal-structure studies using step-wise do-interventions reveal that only a subset of latent steps are causally necessary, with influence routing non-locally; often, early steps directly affect the final answer, creating “skip links” instead of uniform depth. Latent chains also preserve a superposition of competing answer modes up to the final step—output-level commitment can occur earlier than representational commitment (Li et al., 9 Feb 2026).

6. Extensions: Markov, Multimodal, and Policy-Optimized Architectures

MarCos generalizes Coconut by integrating a hidden Markov chain structure: latent thoughts $h_{n+1}, \ldots, h_{n+c}$ 4 transition via learned stochastic dynamics, with explicit emissions as observable rationales. This decouples token-level emission from “deep thought” evolution, enabling step-level control of randomness and up to $h_{n+1}, \ldots, h_{n+c}$ 5 inference speedup, matching or surpassing discrete CoT accuracy on GSM8K (+4.7%) (Liu et al., 29 Sep 2025).

In vision-LLMs, the MCOUT framework enables iterative reasoning in a multimodal latent space by updating a continuous thought vector via multimodal attention. This approach achieves significant gains on MMMU, ScienceQA, and MMStar (up to +8.23% accuracy; +8.27% BLEU), outperforming larger discrete CoT-based VLMs on these benchmarks, and is robust to open-ended and multi-step inference requirements (Pham et al., 18 Aug 2025).

Continuous CoT can further be enhanced by direct continuous supervision and policy optimization strategies (CoT2): by matching intermediate continuous outputs to the token-distribution of top- $h_{n+1}, \ldots, h_{n+c}$ 6 target traces, and using sampling schemes (MTS, Dirichlet), models achieve provable parallelism, superior sample efficiency (each CoT2-MTS rollout equals $h_{n+1}, \ldots, h_{n+c}$ 7 discrete rollouts), and accuracy improvements for tasks with substantial combinatorial search (Gozeten et al., 29 May 2025).

7. Broader Implications and Open Questions

Coconut and its extensions suggest that continuous latent chains enable LLMs to represent and explore multiple reasoning hypotheses in parallel, obviating the inefficiency of discrete token-generation for long chains or for multi-modal reasoning (Zhu et al., 18 May 2025, Wu et al., 23 Jun 2025, Pham et al., 18 Aug 2025). However, pseudo-reasoning and shortcut exploitation remain risks; diagnostic protocols integrating causal intervention and mode-conditional stability analyses are advocated to assess interpretability and faithful reasoning (Zhang et al., 25 Dec 2025, Li et al., 9 Feb 2026).

Open problems include designing supervision and bottleneck strategies to guarantee that latent chains encode truly multi-step logical computation, extending continuous CoT to open-domain and multi-agent planning tasks, and developing robust mechanisms for dynamic allocation of reasoning depth and parallelism within the latent space (Hao et al., 2024, Liu et al., 29 Sep 2025, Gozeten et al., 29 May 2025).

In summary, Chain of Continuous Thought marks a class of frameworks in which LLMs reason via low-bandwidth, expressive latent trajectories, theoretically supporting superposition, parallelism, and efficient policy optimization. As both theoretical and empirical investigations advance, understanding the causal and representational dynamics within such chains remains a critical frontier for trustworthy, scalable reasoning in foundation models.