Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LightThinker: Thinking Step-by-Step Compression (2502.15589v1)

Published 21 Feb 2025 in cs.CL, cs.AI, cs.IR, cs.LG, and cs.MM

Abstract: LLMs have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window. This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance. Code will be released at https://github.com/zjunlp/LightThinker.

Summary

  • The paper introduces LightThinker, which compresses LLM thought chains to reduce token usage while maintaining competitive accuracy.
  • It employs hidden state compression and specialized attention masks to decouple the tasks of reasoning and generation.
  • Experiments demonstrate up to 70% token reduction and a 26% faster inference time with only a marginal drop in accuracy.

The paper introduces LightThinker, a method designed to improve the efficiency of LLMs in complex reasoning tasks by dynamically compressing intermediate thought chains. The approach is motivated by the observation that LLM-generated tokens serve dual purposes: ensuring linguistic fluency and facilitating actual reasoning. LightThinker compresses verbose thought steps into compact representations, discarding the original reasoning chains to reduce the number of tokens stored in the context window, thereby lowering memory overhead and computational costs.

The method involves training the LLM to learn when and how to compress. This is achieved through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. The paper introduces the Dependency (Dep) metric to quantify compression by measuring the reliance on historical tokens during generation.

Here's a breakdown of the key components and findings:

  • Background: The paper discusses the evolution of LLM reasoning from "fast thinking" to "slow thinking," exemplified by Chain-of-Thought (CoT) prompting and o1-like thinking modes. It highlights the computational challenges posed by the Transformer architecture, where the attention mechanism's complexity grows quadratically with context length, and the KV Cache's storage overhead increases linearly.
  • Methodology - LightThinker: The core idea is to train LLMs to dynamically compress the current thought during reasoning, enabling subsequent generation to be based on the compressed content rather than the original long thought. This involves two key questions:
    • When to compress? The paper explores token-level (compressing after a fixed number of tokens) and thought-level compression (compressing after a complete "thought").
    • How to compress? The paper investigates text compression (encoding the current thought into a shorter text) and hidden state compression (compressing the hidden state of the current thought into the hidden states of a few special tokens, i.e., gist tokens). The paper uses hidden state compression because it does not require additional models.
  • Methodology - Data Reconstruction: The original dataset D\mathcal{D} is reconstructed by segmenting the output YY into kk subsequences SS using a segmentation function Seg()Seg(), and inserting special tokens {<w>,C,[o]}\{<w>, C, [o]\} between adjacent subsequences SiS_i, where <w><w> is an optional compression trigger, C={[ci]}i=1CC=\{[c_i]\}_{i=1}^{|C|} consists of C|C| special tokens serving as gist tokens to store compressed content, and [o][o] is a mandatory output token enabling continual generation based on compressed content. The enhanced data is Y^={S1,<w>,C,[o],S2,<w>,C,[o],,Sk}\hat{Y}=\{S_1, \underline{<w>, C, [o]}, S_2, \underline{<w>, C, [o]}, \dots, S_k\}, and the enhanced dataset is defined as $\hat{\mathcal{D}=\{(X,\hat{Y})_i\}_{i=1}^{|\hat{\mathcal{D}|}$.
  • Methodology - Attention Mask: The method manipulates Thought-based Mask Construction. During compression, C(i)C^{(i)} tokens can only attend to the question XX, previous compressed content {C,[o]}(<i)\{C,[o]\}^{(<i)}, and the current thought SiS_i, allowing the LLM to compress the key content of SiS_i into C(i)C^{(i)}. During generation, token [o](i)[o]^{(i)} can only attend to the question XX and the previous compressed content {C,[o]}(i)\{C, [o]\}^{(\le i)}, enabling the LLM to continue reasoning based on the question and previous compressed content.
  • Methodology - Training Objective: The training objective is to maximize the probability distribution: Pθ(S1X)Pθ(S2X,C(1),[o](1))Pθ(SkX,{C(i),[o](i)}i=1k1)P_\theta(S_1|X) \cdot P_\theta (S_2|X,C^{(1)},[o]^{(1)})\cdot \dots \cdot P_\theta ( S_k|X, \{ C^{(i)},[o]^{(i)} \}_{i=1}^{k-1} ), where θ\theta represents the LLM parameters. During training, LLM is not allowed to predict the input XX and the special tokens CC and [o][o]. The training samples are drawn from the D^\mathcal{\hat{D}}, and an attention mask is employed to encourage the LLM to learn to compress and comprehend the compressed content. The entire training process remains based on next token prediction.
  • Methodology - Differences Between LightThinker and AnLLM: In AnLLM, the [ci][c_i] token is tasked with both compressing historical information and generating subsequent content, which tightly couples generation and compression. LightThinker decouples these tasks. Also, LightThinker allows access to XX, historical compressed content, and the current thought during compression, enhancing contextual understanding.
  • Experiments: The method is evaluated on four datasets (GSM8K, MMLU, GPQA, and BBH) using the Qwen2.5-7B series and the Llama3.1-8B series. Evaluation metrics include accuracy (Acc), inference time (Time), peak number of tokens (Peak), and Dependency (Dep).
  • Results: The results indicate that LightThinker reduces peak token usage and inference time while maintaining competitive accuracy. For instance, with the Qwen model, LightThinker reduces peak token usage by 70% and decreases inference time by 26% compared to the Vanilla model, with only a 1% accuracy drop. The Dep metric shows a significant reduction, indicating effective compression.
  • Ablation Studies: Ablation experiments validate the effectiveness of decoupled token design and attention mask. Varying cache size (C|C|) shows that increasing cache size improves accuracy and reduces inference time but also reduces compression frequency and the number of generated tokens.
  • Dependency Metric: The paper introduces the Dependency (Dep) metric, defined as the total number of historical tokens each generated token depends on. This metric effectively measures the degree of compression, with a lower Dep value indicating reduced reliance on the original long context and more significant compression.
    • Dependency for Vanilla is $Dependency = \frac{L_O}^2}{2} + L_P \times L_O$, where LPL_P is the initial prompt length and LOL_O is the model's output length.
    • Dependency for H2O is Dependency=2LPLC+2LOLCLP2LC22Dependency = \frac{2L_PL_C + 2L_OL_C - {L_P}^2 - {L_C}^2}{2}, where LCL_C is the maximum context length set by KV cache compression methods.
  • Efficiency: LightThinker is the only method that reduces the number of generated tokens compared to Vanilla, with an average reduction of 15% on Qwen and 13% on Llama. LightThinker significantly reduces inference time for long-text generation; for example, when generating 32K tokens, the inference time is reduced by 44%. The distribution of compressed token counts follows a long-tail pattern.
  • Limitations: The limitations include unexplored parameter-efficient fine-tuning methods, unclear potential of larger datasets, significant performance degradation on the Llama series models, occasional high memory peaks, fixed number of cache tokens, simplistic segmentation function, and unassessed performance on various tasks.

In summary, LightThinker presents a method for dynamically compressing thought chains during LLM reasoning, offering a balance between reasoning efficiency and accuracy. The Dep metric provides a means to quantify compression, and experiments demonstrate reduced memory overhead and inference time.