- The paper introduces LightThinker, which compresses LLM thought chains to reduce token usage while maintaining competitive accuracy.
- It employs hidden state compression and specialized attention masks to decouple the tasks of reasoning and generation.
- Experiments demonstrate up to 70% token reduction and a 26% faster inference time with only a marginal drop in accuracy.
The paper introduces LightThinker, a method designed to improve the efficiency of LLMs in complex reasoning tasks by dynamically compressing intermediate thought chains. The approach is motivated by the observation that LLM-generated tokens serve dual purposes: ensuring linguistic fluency and facilitating actual reasoning. LightThinker compresses verbose thought steps into compact representations, discarding the original reasoning chains to reduce the number of tokens stored in the context window, thereby lowering memory overhead and computational costs.
The method involves training the LLM to learn when and how to compress. This is achieved through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. The paper introduces the Dependency (Dep) metric to quantify compression by measuring the reliance on historical tokens during generation.
Here's a breakdown of the key components and findings:
- Background: The paper discusses the evolution of LLM reasoning from "fast thinking" to "slow thinking," exemplified by Chain-of-Thought (CoT) prompting and o1-like thinking modes. It highlights the computational challenges posed by the Transformer architecture, where the attention mechanism's complexity grows quadratically with context length, and the KV Cache's storage overhead increases linearly.
- Methodology - LightThinker: The core idea is to train LLMs to dynamically compress the current thought during reasoning, enabling subsequent generation to be based on the compressed content rather than the original long thought. This involves two key questions:
- When to compress? The paper explores token-level (compressing after a fixed number of tokens) and thought-level compression (compressing after a complete "thought").
- How to compress? The paper investigates text compression (encoding the current thought into a shorter text) and hidden state compression (compressing the hidden state of the current thought into the hidden states of a few special tokens, i.e., gist tokens). The paper uses hidden state compression because it does not require additional models.
- Methodology - Data Reconstruction: The original dataset D is reconstructed by segmenting the output Y into k subsequences S using a segmentation function Seg(), and inserting special tokens {<w>,C,[o]} between adjacent subsequences Si, where <w> is an optional compression trigger, C={[ci]}i=1∣C∣ consists of ∣C∣ special tokens serving as gist tokens to store compressed content, and [o] is a mandatory output token enabling continual generation based on compressed content. The enhanced data is Y^={S1,<w>,C,[o],S2,<w>,C,[o],…,Sk}, and the enhanced dataset is defined as $\hat{\mathcal{D}=\{(X,\hat{Y})_i\}_{i=1}^{|\hat{\mathcal{D}|}$.
- Methodology - Attention Mask: The method manipulates Thought-based Mask Construction. During compression, C(i) tokens can only attend to the question X, previous compressed content {C,[o]}(<i), and the current thought Si, allowing the LLM to compress the key content of Si into C(i). During generation, token [o](i) can only attend to the question X and the previous compressed content {C,[o]}(≤i), enabling the LLM to continue reasoning based on the question and previous compressed content.
- Methodology - Training Objective: The training objective is to maximize the probability distribution:
Pθ(S1∣X)⋅Pθ(S2∣X,C(1),[o](1))⋅⋯⋅Pθ(Sk∣X,{C(i),[o](i)}i=1k−1), where θ represents the LLM parameters. During training, LLM is not allowed to predict the input X and the special tokens C and [o]. The training samples are drawn from the D^, and an attention mask is employed to encourage the LLM to learn to compress and comprehend the compressed content. The entire training process remains based on next token prediction.
- Methodology - Differences Between LightThinker and AnLLM: In AnLLM, the [ci] token is tasked with both compressing historical information and generating subsequent content, which tightly couples generation and compression. LightThinker decouples these tasks. Also, LightThinker allows access to X, historical compressed content, and the current thought during compression, enhancing contextual understanding.
- Experiments: The method is evaluated on four datasets (GSM8K, MMLU, GPQA, and BBH) using the Qwen2.5-7B series and the Llama3.1-8B series. Evaluation metrics include accuracy (Acc), inference time (Time), peak number of tokens (Peak), and Dependency (Dep).
- Results: The results indicate that LightThinker reduces peak token usage and inference time while maintaining competitive accuracy. For instance, with the Qwen model, LightThinker reduces peak token usage by 70% and decreases inference time by 26% compared to the Vanilla model, with only a 1% accuracy drop. The Dep metric shows a significant reduction, indicating effective compression.
- Ablation Studies: Ablation experiments validate the effectiveness of decoupled token design and attention mask. Varying cache size (∣C∣) shows that increasing cache size improves accuracy and reduces inference time but also reduces compression frequency and the number of generated tokens.
- Dependency Metric: The paper introduces the Dependency (Dep) metric, defined as the total number of historical tokens each generated token depends on. This metric effectively measures the degree of compression, with a lower Dep value indicating reduced reliance on the original long context and more significant compression.
- Dependency for Vanilla is $Dependency = \frac{L_O}^2}{2} + L_P \times L_O$, where LP is the initial prompt length and LO is the model's output length.
- Dependency for H2O is Dependency=22LPLC+2LOLC−LP2−LC2, where LC is the maximum context length set by KV cache compression methods.
- Efficiency: LightThinker is the only method that reduces the number of generated tokens compared to Vanilla, with an average reduction of 15% on Qwen and 13% on Llama. LightThinker significantly reduces inference time for long-text generation; for example, when generating 32K tokens, the inference time is reduced by 44%. The distribution of compressed token counts follows a long-tail pattern.
- Limitations: The limitations include unexplored parameter-efficient fine-tuning methods, unclear potential of larger datasets, significant performance degradation on the Llama series models, occasional high memory peaks, fixed number of cache tokens, simplistic segmentation function, and unassessed performance on various tasks.
In summary, LightThinker presents a method for dynamically compressing thought chains during LLM reasoning, offering a balance between reasoning efficiency and accuracy. The Dep metric provides a means to quantify compression, and experiments demonstrate reduced memory overhead and inference time.