Thinkless: LLM Learns When to Think (2505.13379v1)

Published 19 May 2025 in cs.CL and cs.AI

Abstract: Reasoning LLMs, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, <short> for concise responses and > for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning LLMs. The code is available at https://github.com/VainF/Thinkless

Summary

The paper introduces Thinkless, a hybrid framework that trains LLMs to selectively use short-form or detailed chain-of-thought reasoning based on query complexity.
The paper employs a two-stage approach using supervised fine-tuning with paired datasets and reinforcement learning with Decoupled Group Relative Policy Optimization to ensure stable and efficient mode selection.
The paper demonstrates up to 90% reduction in token generation on benchmarks while largely preserving task performance, highlighting its practical efficiency benefits.

The paper "Thinkless: LLM Learns When to Think" (Thinkless: LLM Learns When to Think, 19 May 2025) introduces a framework designed to make LLMs more computationally efficient by enabling them to adaptively choose between generating concise, short-form answers and detailed, long-form chain-of-thought reasoning. The core idea is to train an LLM to decide when to engage in extensive thinking based on the query's complexity and the model's own capabilities, rather than defaulting to a computationally expensive reasoning process for every input.

The Thinkless framework is implemented in two main stages:

Distillation for Warm-up:
- Objective: To create a base model capable of generating both short and long-form responses conditioned on specific control tokens.
- Process:
  - Two expert models are used: $\pi_{\text{think}}$ (a reasoning model for long-form answers) and $\pi_{\text{short}}$ (an instruction-following model for concise answers).
  - A paired dataset $\mathcal{D}_{\text{distill}}$ $D_{distill}$ is generated from a prompt corpus $\mathcal{X}$ $X$ . For each prompt $x_i$ $x_{i}$ , a long-form answer $a_i^{\text{think}}$ $a_{i}^{think}$ (prefixed with a > token) and a short-form answer $a_i^{\text{short}}$ $a_{i}^{short}$ (prefixed with a <short> token) are created. >
    1
    
    Dataset Entry: (x_i, <think> a_i^think, <short> a_i^short)
    > * The target LLM ( $\pi_\theta$ $π_{θ}$ ) is then fine-tuned on this dataset using Supervised Fine-Tuning (SFT). This step ensures the model learns to associate the control tokens with the respective response styles and can generate both with comparable likelihood. The paper uses DeepSeek-R1-Distill-Qwen-1.5B as the base model, with DeepSeek-R1-671B as $\pi_{\text{think}}$ $π_{think}$ and Qwen2.5-Math-1.5B-Instruct as $\pi_{\text{short}}$ $π_{short}$ . Datasets like OpenR1 and OpenThoughts are used for this stage. > > 2. Reinforcement Learning (RL) with Decoupled Group Relative Policy Optimization (DeGRPO): > * Objective: To train the model to autonomously select the appropriate control token (<think> or <short>) based on the input query, balancing accuracy and efficiency. > * Challenge with Vanilla GRPO: The paper notes that standard Group Relative Policy Optimization (GRPO) struggles with this task due to: > * Mode-Accuracy Imbalance: The single control token gets a much weaker gradient signal compared to the many response tokens. > * Think-Short Imbalance: Long responses (after <think>) have many more tokens than short responses, further diluting the gradient for the <think> control token due to normalization by total sequence length. This can lead to "mode collapse," where the model predominantly chooses one mode (typically <short>) early in training. > * DeGRPO Solution: > * The learning objective is decomposed to treat the control token and response tokens separately. > * The loss for the control token $\mathcal{L}_{i,0}(\theta)$ $L_{i, 0} (θ)$ is weighted by a coefficient $\alpha$ $α$ , and the loss for response tokens $\sum_{t=1}^{T_i} \mathcal{L}_{i,t}(\theta)$ $\sum_{t = 1}^{T_{i}} L_{i, t} (θ)$ is normalized by the response length $T_i$ $T_{i}$ . > > $J_{\text{DeGRPO}}(\theta) = \mathbb{E}_{x,a_i}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \alpha\, \mathcal{L}_{i,0}(\theta) + \frac{1}{T_i} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}(\theta) - \beta\, \mathbb{D}_{\mathrm{KL}}\bigl[\pi_\theta(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\bigr] \Bigr) \Biggr]$ > > * This decoupling and independent normalization ensure the control token receives a consistent gradient scale, preventing mode collapse and allowing for more stable training. The paper found $\alpha=1/1000$ $α = 1/1000$ worked well. > * Reward Design: A simple reward function is used: > * $1.0$: If $c = \text{<short>}$ $c = <short>$ and the answer is correct. > * $1.0 - \gamma$ $1.0 - γ$ : If $c = \text{<think>}$ $c = <think>$ and the answer is correct (where $0 < \gamma < 1$ $0 < γ < 1$ penalizes the longer response). > * $-1.0$ $- 1.0$ : If the answer is incorrect. > * RL Training Data: The DeepScaleR dataset (approx. 40K labeled math examples) is primarily used. > > ### Implementation Considerations and Practical Applications > > * Model Selection: The choice of base LLM and expert models for distillation is crucial. The base model should be capable enough to learn both reasoning styles, and the experts should be proficient in their respective tasks. > > * Dataset Curation: High-quality paired data for the warm-up SFT stage is important. The paper explores different SFT datasets (OpenR1, OpenThoughts) and notes that larger, more comprehensive datasets generally lead to better initial hybrid model performance, though with diminishing returns. > > * Hyperparameter Tuning: The DeGRPO coefficient $\alpha$ $α$ is a key hyperparameter. A larger $\alpha$ $α$ can accelerate the learning of mode selection, but if too large, it might prematurely assign complex tasks to the long-chain mode without allowing the RL to improve short-mode performance. > > * Computational Requirements: Training involves SFT and RL. The paper used a single node with 4 H100 GPUs. SFT was for 1 epoch, and RL was for 600 steps. Context lengths varied (16K for SFT, 24K for RL). > > * Deployment Strategy: Once trained, the Thinkless model first generates either <short> or <think> and then proceeds with the corresponding response style. This allows for dynamic adjustment of computational load at inference time. > > ### Key Experimental Findings and Practical Implications > > * Efficiency Gains: Thinkless significantly reduces the use of long-chain thinking by 50%-90% on benchmarks like Minerva Algebra, MATH-500, and GSM8K, while largely preserving task performance. This translates to substantial savings in token generation, memory, and computational cost. > * For example, on Minerva Algebra, Thinkless used the <think> mode for only 25.88% of samples, reducing tokens from ~3029 (DeepSeek-R1-1.5B always thinking) to 1144, with a minimal drop in Pass@1 accuracy (0.9577 to 0.9459). > > * Adaptive Behavior: The model learns to use <short> for simpler problems and <think> for more complex ones. On very challenging datasets like AIME, the model naturally adopts a higher proportion of long-form reasoning (100% in the reported table). > > * Overcoming Mode Collapse: DeGRPO effectively addresses the mode collapse issue observed with vanilla GRPO. The paper illustrates this with training dynamics (Figure \ref{fig:training_dynamics}), showing a "U-shape" learning curve for DeGRPO: an initial increase in long-chain outputs (due to higher initial accuracy) followed by a decrease as short-chain accuracy improves and the model learns better task allocation. > > * Comparison to Alternatives: > * Fixed Heuristics (e.g., Model Merging, CoT-Valve): These methods require manual tuning of parameters (like merging ratios or LoRA $\alpha$ $α$ ) and a setting optimal for one dataset might degrade performance on another. Thinkless learns this trade-off automatically. > * Router-based Models: While some systems use a separate LLM to route queries, these routers often lack awareness of the downstream model's capabilities. Thinkless integrates this decision-making into the model itself. > > * Warm-up Stage Effectiveness: The distillation warm-up effectively teaches the model the two response styles. The paper notes that even a reasoning-focused LLM can be distilled to produce good short responses. > > ### Case Study Insights > > A case paper on MATH-500 (Figure \ref{fig:math500_strategy}) shows the model's learned policy: > > * Problems assigned a high probability of <think> are complex, often involving multiple conditions or logical steps. > > * Problems assigned a low probability of <think> (i.e., favoring <short>) are typically simpler arithmetic or direct application problems. > > ### Limitations and Future Work > > * The warm-up distillation stage could be further optimized (e.g., using model merging techniques or LoRA to prevent catastrophic forgetting) to improve the initial hybrid model's performance. > > * The current validation is primarily on mathematical reasoning datasets. Expanding to more diverse domains would enhance the general applicability of Thinkless. > > In summary, Thinkless offers a practical and learnable approach to improve LLM inference efficiency. By training a model to intelligently decide when to engage in costly chain-of-thought reasoning, it balances performance with computational resources. The DeGRPO algorithm is a key contribution, enabling stable training of this hybrid reasoning capability. This research has significant implications for deploying powerful reasoning LLMs in resource-constrained environments or latency-sensitive applications.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - VainF/Thinkless: [Preprint 2025] Thinkless: LLMs Learn When to Think (3 stars)

Tweets

https://twitter.com/Vinnnnf/status/1924687359586840712

https://twitter.com/fly51fly/status/1926390191180165177

https://twitter.com/Vinnnnf/status/1924859014200099262

https://twitter.com/MSuryavansh/status/1925620976542175483

https://twitter.com/wangleineo/status/1928663320917131274

https://twitter.com/GptMaestro/status/1929699315318247518

YouTube

Show All Videos