Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoT-Valve: Tunable Chain-of-Thought Compression

Updated 16 December 2025
  • The paper introduces CoT-Valve, a method that tunes the chain-of-thought length by manipulating a single parameter direction in model space.
  • It employs a lightweight LoRA branch to adjust reasoning granularity continuously through interpolation (shortening) and extrapolation (distillation) without extra classifiers.
  • Experimental benchmarks show that CoT-Valve reduces token usage by up to 70% while maintaining high accuracy, offering enhanced computational efficiency.

CoT-Valve is a parameter-space tuning and inference strategy enabling LLMs to generate chain-of-thought (CoT) reasoning paths of controlled length, with fine-grained tradeoff between inference cost and solution accuracy. It introduces a single direction in model parameter space that, when manipulated, elastically compresses or expands the model’s reasoning chain during inference, with applications to arithmetic and mathematical reasoning benchmarks. The mechanism operates via a lightweight LoRA branch, making it possible to modulate reasoning granularity in one fine-tuned model without prompt engineering, auxiliary classifiers, or multi-model ensembles (Ma et al., 13 Feb 2025). The following sections provide a detailed exposition of conceptual foundation, direction identification, dataset construction, quantitative benchmarking, comparative ablation, and identified limitations.

1. Definition and Formal Objectives

CoT-Valve is constructed around a central objective: dynamic control of CoT length via a single model parameter update. Let θ\theta denote the original model weights trained for CoT reasoning. Given an input qq, the model generates a reasoning chain of tokens {t1,,tn}\{t_1,\dots,t_n\} and a final answer aa governed by the joint probability:

p(a,t1,,tnq;θ)=p(at1,,tn,q;θ)ip(tit<i,q;θ)p(a, t_1,\dots,t_n \mid q; \theta) = p(a \mid t_1,\dots,t_n, q;\theta) \cdot \prod_{i} p(t_i \mid t_{<i}, q;\theta)

CoT-Valve identifies a direction Δθ\Delta\theta such that applying θ=θ+Δθ\theta' = \theta + \Delta\theta systematically yields shorter chains {t1,,tm}\{t_1,\dots,t_m\} with m<nm<n, aiming to preserve ground-truth accuracy. At inference, model parameters are set to

θ(α)=θ+αΔθ,αR+,\theta(\alpha) = \theta + \alpha \Delta\theta,\quad \alpha\in \mathbb{R}^+,

where varying α\alpha interpolates between longer and shorter reasoning chains. For α(0,1]\alpha\in(0,1], chains are gradually compressed; for α>1\alpha>1, extrapolation produces ultra-short, distilled chains.

2. Identification and Manipulation of the Control Direction

The parameter update direction Δθ\Delta\theta is obtained by fine-tuning on pairs of reasoning chains of different lengths for the same (q,a)(q, a) pair. Specifically, the optimization seeks:

maxΔθE(q,a)[p(at1,,tm,q;θ+Δθ)i=1mp(tit<i,q;θ+Δθ)]\max_{\Delta\theta} \mathbb{E}_{(q,a)} \bigg[ p(a \mid t_1,\dots,t_m, q; \theta+\Delta\theta) \cdot \prod_{i=1}^m p(t_i \mid t_{<i}, q; \theta+\Delta\theta) \bigg]

with m<nm<n for shortened CoTs. Practically, Δθ\Delta\theta is implemented via a low-rank adaptation (LoRA) branch inserted into selected linear layers. At inference, scaling Δθ\Delta\theta by α\alpha is operationally equivalent to scaling the LoRA branch, facilitating real-time control without any change to prompts or tokenization routines.

Interpolation (0<α<10<\alpha<1) yields reasoning chains of intermediate length; extrapolation (α>1\alpha>1) further distills reasoning. This operation is continuous and model-native, requiring no downstream classifiers.

3. Length-Compressible Tuning and Progressive Compression Procedures

There are two principal variants extending CoT-Valve's compression mechanism:

  • CoT-Valve++ (Precise Length-Compressible Tuning): Given a MixChain dataset D={(q,a,t1,,tm,β)}\mathcal{D}' = \{(q,a,t_1,\dots,t_m,\beta)\}, initialize a LoRA branch Δθ\Delta\theta' and iteratively update via
    1
    2
    3
    
    θ̂ ← θ + β·Δθ′
    L ← − [ log p(a|t₁…tₘ,q;θ̂) + ∑_{i=1}^m log p(t_i|t_{<i},q;θ̂) ]
    backpropagate ∇_{Δθ′} L
    with β\beta encoding normalized chain length (β[0,1]\beta\in[0,1]).
  • CoT-Valve+P (Progressive Chain Length Compression): Sort MixChain levels S=[L0,...,LK]S = [L_0,...,L_K] by descending chain length. Initialize Δθ\Delta\theta coarsely, then finetune sequentially across {Lk}\{L_k\} using standard objectives, yielding gradually more compressed reasoning.

MixChain datasets are constructed either via human annotation and interpolation (MixChain-C, “cold start”) or by zero-shot interpolation across model checkpoints (MixChain-Z) using Δ=θ2θ1\Delta=\theta_2-\theta_1 and varying α\alpha to produce multiple chain-length variants.

4. Experimental Results and Quantitative Benchmarks

CoT-Valve and its improved variants have been empirically validated on leading mathematical reasoning benchmarks:

Method Acc (%) #Tokens ACU (×102\times 10^2)
Original QwQ-32B 95.07 741.1 0.40
Prompt-based control 93.6 355.5 0.82
CoT-Valve (Ground-Truth) 94.0 352.8 0.83
CoT-Valve++ (MixChain-C) 94.4 276.3 1.07
CoT-Valve+P (MixChain-Z) 94.9 225.5 1.32

For QwQ-32B on AIME24:

Method Acc/30 #Tokens ACU
Original QwQ-32B 14/30 6827.3 0.021
Prompt-based control 13/30 6102.5 0.022
CoT-Valve+P (MixChain-Z) 13/30 4629.6 0.029

CoT-Valve+P achieves a reduction in chain length by approximately 60–70% with less than 0.2% absolute accuracy drop. The accuracy per computation unit (ACU) metric, defined as

ACU=AccuracyModel Params×Token Length,\text{ACU} = \frac{\text{Accuracy}}{\text{Model Params} \times \text{Token Length}},

demonstrates marked computational efficiency improvements over prompt-based approaches.

5. Comparison with Prompt-Based Control and Ablation Findings

Prompt-control approaches (“Generate solution in <N<N tokens”) frequently fail to produce the desired shorter CoTs, with models often exceeding token budgets significantly—a request for <20<20 tokens can routinely yield >350>350 tokens. CoT-Valve, by direct scaling of Δθ\Delta\theta, achieves smooth control of CoT lengths and accurate trade-offs: as few as 133 tokens can be generated with 87.5% accuracy on QwQ versus prompt-based control’s 355 tokens.

Progressive compression schedules in CoT-Valve+P outperform direct supervised fine-tuning on shortest chains, with gradual schedules maintaining higher accuracy for comparable or smaller token counts.

6. Limitations and Prospective Work

CoT-Valve embeds control in a single learned direction Δθ\Delta\theta in parameter space, which may limit compressibility for diverse tasks where multiple task-specific directions may be optimal. Current mechanisms provide uniform shortening across the chain; segment-wise and context-dependent compression has not been realized. Extreme extrapolation (α1\alpha \gg 1) risks omitting essential reasoning steps, yielding under-explained answers on occasion.

Optimal scheduling of the compression parameter α\alpha for each query—potentially using difficulty estimators—remains an open engineering challenge to balance cost against reliability. Further research may address multi-directional control, finer granularity in chain compression, and adaptive inference pipelines (Ma et al., 13 Feb 2025).

7. Practical Implications and Significance

CoT-Valve establishes a lightweight, model-native framework for elastic reasoning cost management in LLMs. It provides single-model, continuous modulation over the reasoning path’s verbosity and granularity, without reliance on token-level prompt constraints or retraining bespoke models per task length. This property is leveraged to compress reasoning chains in the QwQ-32B-Preview model on GSM8K by over 500 tokens—while maintaining 94.92% accuracy—and on AIME with a single additional error out of 30. These computational gains suggest scalable applicability for resource-constrained or latency-sensitive deployments.

A plausible implication is that parameter-space valves of this form may be generalized for a broader class of generative control problems in neural reasoning systems, enabling post-training adaptation of solution granularity across diverse downstream applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoT-Valve.