CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation (2502.21074v2)

Published 28 Feb 2025 in cs.CL

Abstract: Chain-of-Thought (CoT) reasoning enhances LLMs by encouraging step-by-step reasoning in natural language. However, leveraging a latent continuous space for reasoning may offer benefits in terms of both efficiency and robustness. Prior implicit CoT methods attempt to bypass language completely by reasoning in continuous space but have consistently underperformed compared to the standard explicit CoT approach. We introduce CODI (Continuous Chain-of-Thought via Self-Distillation), a novel training framework that effectively compresses natural language CoT into continuous space. CODI jointly trains a teacher task (Explicit CoT) and a student task (Implicit CoT), distilling the reasoning ability from language into continuous space by aligning the hidden states of a designated token. Our experiments show that CODI is the first implicit CoT approach to match the performance of explicit CoT on GSM8k at the GPT-2 scale, achieving a 3.1x compression rate and outperforming the previous state-of-the-art by 28.2% in accuracy. CODI also demonstrates robustness, generalizable to complex datasets, and interpretability. These results validate that LLMs can reason effectively not only in natural language, but also in a latent continuous space.

PDF Abstract

The paper introduces CODI (Continuous Chain-of-Thought via Self-Distillation), a framework designed to compress the chain-of-thought reasoning process into a continuous space, with the goal of improving efficiency while maintaining performance. CODI employs a self-distillation approach, where a shared model acts as both teacher and student, learning explicit and implicit CoT jointly, and aligning their hidden activations during the generation of the final answer.

The central idea is to move away from discrete, natural language representations of CoT, which may not be optimal for reasoning, towards dense, continuous representations. The paper highlights that while implicit CoT methods exist, they generally underperform explicit CoT. CODI addresses this gap by distilling knowledge from explicit CoT (teacher) to implicit CoT (student) within the same model.

The CODI framework involves two main tasks:

Teacher Task: The teacher task learns to generate explicit CoTs using a standard LLMing objective. This provides the model with structured reasoning patterns. The loss function is defined as:

$\mathcal{L}_{\text{teacher}} = -\frac{1}{N} \sum_{i=1}^{N} \log P(y_i \mid y_{1:i-1}, Q)$

where:
- $P$ is the probability distribution of the LLM
- $y$ refers to both the CoT and the answer labels
- $Q$ refers to the question tokens.
Student Task: The student task learns to generate continuous thoughts by autoregressively propagating hidden states and predicting the final answer. The loss function is defined as:

$\mathcal{L}_{\text{student}} = - \frac{1}{N} \sum_{i=1}^{N} \log P(y_i \mid y_{1:i-1}, Q, Z)$

where:
- $P$ is the probability distribution of the LLM
- $y$ refers to the answer label
- $Q$ refers to the question tokens
- $Z$ refers to the continuous thoughts.
The student task uses special tokens, bot and eot, to mark the start and end of continuous reasoning. A two-layer Multi-Layer Perceptron (MLP) with layer normalization transforms the hidden representations of continuous thought tokens.

Knowledge distillation is achieved by aligning the hidden activations of a key token between the teacher and student tasks. Specifically, the hidden activation of the token immediately preceding the answer (e.g., the colon in "The answer is:") is used. This token is chosen because it is believed to encode crucial reasoning information. The alignment is enforced using an L1 loss:

$\mathcal{L}_{\text{KD}} = \frac{1}{M} \sum_{l=1}^M |\text{sg}[h_{\text{teacher}}^l]-h_{\text{student}}^l|$

where:

$M$ is the number of layers in the LLM
sg denotes stop gradient
$h^l$ is the hidden activation of the LLM's $l$ -th layer.

The overall training objective is a weighted sum of the teacher loss, student loss, and knowledge distillation loss:

$\mathcal{L} = \alpha \mathcal{L}_{\text{teacher}} + \beta \mathcal{L}_{\text{student}} + \gamma \mathcal{L}_{\text{KD}}$

where $\alpha$ , $\beta$ , and $\gamma$ are hyperparameters.

The paper provides a theoretical justification for aligning the hidden activations of the selected token. Drawing upon observations from in-context learning, the authors posit that CoT tokens induce a shift in the hidden activation values of the target token. This "CoT shift" is formalized as:

$\mathbf{h}^l_{\text{CoT}} \approx \mathbf{h}^l_{\text{no-CoT}} + f\Big(W_V R(W_K R)^Tq\Big)$

where:

$q$ is the query of this target token
$\mathbf{h}^l_{\text{CoT}}$ is the hidden activation at layer $l$ with CoT (equivalent to $\mathbf{h}^l_{\text{teacher}}$ )
$\mathbf{h}^l_{\text{no-CoT}}$ is the corresponding activation without CoT
$R$ is the CoT rationale
$W_V$ is the model's value parameters
$W_K$ is the model's key parameters
$f$ is a non-linear function

This suggests that the target token's hidden activation encodes the influence of preceding reasoning steps, and the student can learn this shift by minimizing the L1 distance with the teacher's hidden activation.

The paper details experiments conducted on mathematical reasoning tasks, using the GSM8k dataset and its variants. The results show that CODI achieves performance comparable to explicit CoT methods, while also demonstrating efficiency gains through compression. CODI achieves a 3.1x compression ratio, and is the first implicit CoT method to match explicit CoT's performance on GSM8k, surpassing the previous state-of-the-art by 28.2\% in accuracy. The method is also shown to be robust, scalable, and generalizable to more complex CoT datasets.

Furthermore, the paper explores the interpretability of CODI by decoding its continuous thoughts and analyzing the attended tokens. The authors find that CODI can produce observable intermediate results within its continuous thoughts.

Ablation studies validate the design choices in CODI, including the use of a shared model for the teacher and student tasks, the importance of the distillation loss, and the impact of excluding the final step of the CoT chain during training.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Zhenyi Shen (2 papers)
Hanqi Yan (18 papers)
Linhai Zhang (12 papers)
Zhanghao Hu (3 papers)
Yali Du (63 papers)
Yulan He (113 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1897041839611633688

https://twitter.com/MervinPraison/status/1897307305206800731

https://twitter.com/ChenningYu/status/1922370555258270150

Reddit

New AI breakthrough,Faster and Smaller low thinking token with "CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation " (157 points, 7 comments)