AdaptThink: Reasoning Models Can Learn When to Think
(2505.13417v1)
Published 19 May 2025 in cs.CL, cs.AI, and cs.LG
Abstract: Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.
Large reasoning models (LRMs) have shown impressive performance by mimicking human-like deep thinking processes, often involving long chains of thought. However, this extensive thinking significantly increases inference overhead and latency, especially for simpler tasks where such detailed reasoning is unnecessary. The paper "AdaptThink: Reasoning Models Can Learn When to Think" (Zhang et al., 19 May 2025) addresses this by proposing a method for models to adaptively choose between a "Thinking" mode (generating a detailed chain of thought) and a "NoThinking" mode (directly generating the final solution).
The authors first demonstrate that a simplified "NoThinking" approach—prompting the model with an empty thinking segment (e.g., > </think>)—can achieve comparable or even better performance than "Thinking" on relatively simple problems, while significantly reducing token usage. This observation motivates AdaptThink, a reinforcement learning (RL) algorithm designed to teach models when to engage in deep thinking.
AdaptThink has two core components:
Constrained Optimization Objective: The primary goal is to maximize the selection of the "NoThinking" mode without degrading the model's overall performance.
Let πθ be the policy model, πθref be a reference model (initial state of πθ), R(x,y) be the reward (accuracy), and $\mathbbm{1}(y_1\!=\!</think>)$ be an indicator function that is 1 if the model chooses "NoThinking" (i.e., the first generated token y1 is </think>). The optimization problem is:
This is optimized using a PPO-style policy gradient method, where the advantage function is $A(x, y) = \mathbbm{1}(y_1\!=\!{</think>})\cdot\delta + R(x, y) - \bar{R}_\text{ref}(x)$.
Importance Sampling: Since the initial model naturally defaults to "Thinking" mode, it's unlikely to generate "NoThinking" samples at the start of training (cold-start problem). To address this, an importance sampling distribution πIS(⋅∣x) is introduced. This distribution forces the model to sample "NoThinking" (first token is ``) and "Thinking" (first token is a common starting word like "Alright") with equal probability (0.5 each) for the very first token generation. For subsequent tokens (t>1), it follows the old policy πθold.
The importance sampling distribution πIS(yt=a∣x,y<t) is defined as:
The final PPO-style loss function LAT(θ) uses this πIS for sampling responses. This strategy ensures that the model explores both modes throughout training.
The AdaptThink algorithm can be summarized as follows:
Algorithm AdaptThink:
Input: policy model pi_theta, dataset D, hyperparameters K, delta, epsilon
Initialize: reference model pi_theta_ref = pi_theta
1. For each x in D:
Sample K responses {y'_i} from pi_theta_ref(.|x)
Calculate mean_R_ref(x) = (1/K) * sum(R(x, y'_i))
2. For step = 1 to M:
Update old policy model pi_theta_old = pi_theta
Define importance sampling distribution pi_IS based on pi_theta_old (Equation 7)
Sample a batch D_b from D
For each x in D_b:
Sample K responses {y_i} from pi_IS(.|x) // Half Thinking, half NoThinking
Estimate L_AT(theta) using these samples and the advantage A(x,y) (Equation 9)
Update pi_theta by minimizing L_AT(theta)
Output: pi_theta
Training Data: DeepScaleR dataset (40K math problems).
Evaluation Data: GSM8K, MATH500, AIME 2024 (math datasets of increasing difficulty), and MMLU (for OOD generalization).
Metrics: Accuracy, response length, and ratio of NoThinking responses (RatioNT).
Hyperparameters: K=16 (samples per prompt), δ=0.05, ϵ=0.2 (PPO clipping). Training was done for 1 epoch.
Baselines: Methods focusing on reducing response length like DPO_Shortest, OverThink, DAST, O1-Pruner, TLMRE, ModelMerging, and RFT_MixThinking.
Key Results:
AdaptThink significantly reduced average response length while improving accuracy. For DeepSeek-R1-Distill-Qwen-1.5B, it achieved a 53.0% length reduction and a 2.4% accuracy increase on average across three math datasets. For the 7B model, it was a 40.1% length reduction and a 2.3% accuracy increase.
The model learned to adapt its thinking mode based on problem difficulty: it used "NoThinking" more for easier datasets (e.g., GSM8K: 86.9% RatioNT for 1.5B model) and "Thinking" more for harder ones (e.g., AIME 2024: 40.4% RatioNT for 1.5B model). This trend was also observed within difficulty levels of the MATH500 dataset.
The hyperparameter δ controls the trade-off: higher δ encourages more "NoThinking" responses (shorter length) but can lead to smaller accuracy gains. Even with δ=0, the model chose "NoThinking" frequently for simpler problems.
Importance sampling was crucial. Without it (naive GRPO), the model failed to learn to use "NoThinking" mode and focused only on improving accuracy within the "Thinking" mode.
Ablation studies showed that AdaptThink only slightly increased "implicit thinking" (thinking-like keywords appearing in "NoThinking" responses) for the 7B model, suggesting it doesn't simply hide thinking within the solution part.
AdaptThink showed good generalization to the MMLU dataset (OOD), reducing length by over 30% and improving accuracy.
Practical Implications:
Efficiency: AdaptThink offers a practical way to reduce the computational cost and latency of LRMs by avoiding unnecessary thinking for simpler queries. This is crucial for real-world applications where users expect fast responses.
Performance: Beyond efficiency, AdaptThink can also improve model accuracy, possibly by allowing the model to allocate its "thinking budget" more effectively to problems that actually require it.
Deployment: Models trained with AdaptThink can be deployed without changes to the inference pipeline, as the decision to think or not is made internally by the model at the beginning of generation.
Training: The RL-based approach requires a dataset with verifiable rewards (e.g., math problems with correct answers). The computational cost for training involves RL iterations, but the authors provide details (e.g., 32 hours on 8xH800 for the 1.5B model).
Limitations:
Experiments were limited to 1.5B and 7B models due to resource constraints.
Training was primarily on mathematical datasets due to the ease of obtaining verifiable rewards. Generalization to other domains might require more diverse training data with reward signals.
In conclusion, AdaptThink presents a novel and effective RL-based method for LRMs to dynamically decide whether to engage in a lengthy thinking process or provide a direct solution, optimizing the balance between reasoning quality and inference efficiency. Its practical benefits include reduced latency and computational costs, alongside potential accuracy improvements.