Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Concise Reasoning via Reinforcement Learning (2504.05185v2)

Published 7 Apr 2025 in cs.CL

Abstract: Despite significant advancements in LLMs, a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. We show that introducing a secondary phase of RL training, using a very small set of problems, can significantly reduce chains of thought while maintaining or even enhancing accuracy. Additionally, we demonstrate that, while GRPO shares some interesting properties of PPO, it suffers from collapse modes, which limit its reliability for concise reasoning. Finally, we validate our conclusions through extensive experimental results.

Summary

  • The paper reveals that RL fine-tuning with PPO naturally encourages shorter, accurate reasoning without the need for excessively long responses.
  • The proposed two-phase strategy first builds reasoning capacity on challenging problems then enforces conciseness, achieving a 40-54% reduction in response length.
  • Minimal data for the conciseness phase proves sufficient to boost model efficiency, reduce computational costs, and improve robustness under deterministic decoding.

This paper investigates the common observation that LLMs fine-tuned with reinforcement learning (RL) for reasoning tasks tend to produce very long chains of thought (CoT). The authors argue that this verbosity is not inherently necessary for accuracy but rather a byproduct of the RL optimization process, specifically Proximal Policy Optimization (PPO). They propose a method to train models for more concise reasoning without sacrificing, and sometimes even improving, accuracy (2504.05185).

Problem: State-of-the-art reasoning models often generate excessively long responses, increasing computational costs, latency, and resource requirements. This is often assumed to be necessary for achieving high accuracy.

Core Argument & Analysis:

  1. Conciseness-Accuracy Correlation: The paper first establishes that, counter-intuitively, correct answers generated by various LLMs (both reasoning-focused and general models) tend to be significantly shorter than incorrect ones across benchmarks like MATH500, AIME'24, and MMLU-STEM (Table 1). This suggests that excessive length is not a prerequisite for correctness.
  2. RL Optimization Drives Length: The authors frame each reasoning problem as a Markov Decision Process (MDP). They analyze the PPO loss function commonly used in RL fine-tuning. Their mathematical analysis (Section 4) shows that when using PPO with Generalized Advantage Estimation (GAE) and a discount factor γ=1\gamma=1 but λ<1\lambda < 1:
    • Positive terminal rewards (correct answers, r>0r>0) lead to a negative average PPO loss (LavgL_{\text{avg}}), which is minimized (made more negative) by shorter response lengths (TT).
    • Negative terminal rewards (incorrect answers, r<0r<0) lead to a positive average PPO loss, which is minimized (made closer to zero) by longer response lengths.
    • Setting λ=1\lambda=1 is shown to be problematic, potentially causing instability (overflow/underflow issues) and being highly sensitive to value estimation errors (Appendix Figures \ref{fig:lambda1-olympiad}, \ref{fig:lambda1-math}).
  3. Minimal Data Sufficiency: The MDP perspective suggests RL training doesn't necessarily require massive datasets. Unlike supervised learning, online RL explores the response space dynamically, mitigating overfitting concerns even with small problem sets.

Proposed Methodology: Two-Phase RL Strategy

Based on the analysis, the paper proposes a two-phase RL fine-tuning approach:

  1. Phase 1: Enhance Reasoning Capacity: Train the model on challenging problems. This phase focuses on improving the model's core problem-solving abilities. As the model frequently fails (receives negative rewards), PPO tends to encourage longer responses during this phase, consistent with observations in models like DeepSeek-R1. This phase can leverage existing RL-trained reasoning models.
  2. Phase 2: Enforce Conciseness: Continue RL training, but focus on a smaller set of problems that the model can solve at least occasionally (non-zero probability pap_a of success). Since the model now achieves positive rewards more often, PPO naturally favors shorter responses. This phase aims to reduce verbosity while preserving or improving the accuracy gained in Phase 1. The paper demonstrates this phase can be effective with remarkably few examples (e.g., 8 problems).

Implementation Details:

  • Models: Experiments used DeepSeek-R1 distilled Qwen models (1.5B, 7B) and Qwen-Math-v2.5 models.
  • RL Algorithm: PPO with GAE (λ=0.95\lambda=0.95, γ=1\gamma=1).
  • Reward Scheme: A simple scheme based on the final answer's correctness and formatting:
    • +1: Correct answer, correctly boxed.
    • -0.5: Incorrect answer, correctly boxed.
    • -1: No boxed answer.
  • Training Data: Small sets of problems (4 or 8) from MATH, AIME, or OlympiadBench datasets were used for different experiments.
  • Sampling: During training, 8 independent responses were generated per example. Evaluation used 4 samples per query at temperature 0.6, top-p 0.95 unless otherwise specified (e.g., temperature 0 experiments).

Experimental Results & Key Findings:

  1. Problem Difficulty Matters: Training on harder problems (low pap_a) initially increased length, while training on easier problems (higher pap_a) led to quicker length reduction alongside accuracy improvements (Figure \ref{fig:difficulty}).
  2. Phase 2 Effectiveness: Applying the second phase (training on 8 MATH problems) to R1-distilled Qwen models significantly reduced average response length (by 40-54%) across MATH, AIME, AMC, and even MMLU-STEM benchmarks, while maintaining or slightly improving accuracy (Table \ref{table:checkpoint}, Figure \ref{fig:eval-8ex}).
  3. Improved Robustness: The post-trained models showed much less performance degradation when switching from sampling (temp=0.6, n=4) to greedy decoding (temp=0, n=1), indicating increased robustness (Table \ref{table:sampling_degradation}). For R1 1.5B on MATH500, the accuracy drop was 16.9% for the baseline but 0% for the concise model.
  4. Boosting Non-RL Models: Applying RL (Phase 1 concept) to Qwen-Math-v2.5 models (trained via supervised learning, not RL) using just 4 MATH problems yielded substantial accuracy gains (up to 30% absolute improvement on MATH500 for the 1.5B model), demonstrating the power of minimal RL fine-tuning (Table \ref{table:math_comparison}).
  5. Importance of λ<1\lambda < 1: Experiments confirmed the theoretical analysis, showing unstable training (value overflow/underflow) when λ=1\lambda=1 (Appendix Figures \ref{fig:lambda1-olympiad}, \ref{fig:lambda1-math}).

Practical Implications for Developers:

  • Challenge Length Assumption: Don't assume longer CoT is always better or necessary. Correct reasoning can often be concise.
  • Cost Reduction: Training for conciseness can significantly reduce token usage during inference, leading to lower API costs and faster response times.
  • Efficient Fine-tuning: The second phase of RL tuning requires minimal data (potentially just a handful of diverse, solvable problems) and compute, making it practical even in resource-constrained settings.
  • Improved Reliability: Concise models might be more robust, performing better under deterministic decoding (temperature 0), which is often desirable for production systems.
  • PPO Configuration: When using PPO for LLM fine-tuning, setting λ<1\lambda < 1 (e.g., 0.95) is crucial for stability and predictable behavior regarding response length based on rewards. Avoid λ=1\lambda=1.
  • Dataset Curation: Ensure RL fine-tuning datasets include a mix of problems, including some that the model can solve, to naturally encourage conciseness over time or specifically in a dedicated second phase.

In conclusion, this work provides a theoretical and empirical basis for understanding the relationship between RL optimization (PPO) and response length in reasoning LLMs. It offers a practical two-phase RL strategy to create more concise, efficient, and robust reasoning models, potentially using very small datasets for the conciseness-tuning phase.