Demystifying Long Chain-of-Thought Reasoning in LLMs (2502.03373v1)

Published 5 Feb 2025 in cs.CL and cs.LG

Abstract: Scaling inference compute enhances reasoning in LLMs, with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

PDF Abstract

This paper investigates how LLMs learn to perform long chain-of-thought (CoT) reasoning, which involves extended, structured thought processes including backtracking and error correction, often associated with improved performance on complex tasks (Yeo et al., 5 Feb 2025 ). The paper focuses on practical strategies for training models to generate these long CoTs using supervised fine-tuning (SFT) and reinforcement learning (RL).

Key Findings and Practical Implications:

Impact of SFT on Long CoT:
- SFT Scaling: SFT using long CoT data (distilled from capable models like QwQ-32B-Preview) scales to significantly higher accuracy on reasoning tasks compared to SFT with short CoT data, which tends to plateau earlier at lower performance levels. For implementation, this suggests prioritizing the collection or generation of high-quality, long reasoning trace examples for SFT if the goal is high performance on complex reasoning.
- SFT for RL Initialization: Initializing RL training with a model already fine-tuned on long CoT data leads to much larger performance gains from RL compared to initializing with a short CoT SFT model. Models trained with short CoT SFT see minimal improvement from subsequent RL. Practically, performing SFT with long CoT data first appears crucial for unlocking the full potential of RL for reasoning tasks.
- Source of Long CoT Data: SFT data distilled from models exhibiting emergent long CoT patterns (like QwQ-32B-Preview) yields better generalization and larger RL gains than SFT data based on constructed long CoT patterns (e.g., using predefined action prompts). This implies that mimicking the natural, complex reasoning structures that emerge in advanced models is more effective than trying to engineer them from simpler components.

Reward Design for Stable RL:

CoT Length Instability: Standard RL with simple outcome-based rewards (e.g., +1 for correct answer) can lead to unstable CoT length growth. Models might generate excessively long CoTs that exceed context limits, causing performance degradation.

Cosine Reward Shaping: To address instability, the paper proposes a "Cosine Reward" function. This function shapes the reward based on both correctness and CoT length using a piecewise cosine curve. It incentivizes shorter CoTs for correct answers (efficiency) and longer CoTs for incorrect answers (encouraging more "thinking time" when needed). It also includes a penalty for exceeding the maximum length. This provides a practical mechanism to control and stabilize CoT length during RL training.

# Pseudocode for Cosine Reward Logic
function calculate_reward(correctness, generation_length, max_length, params):
  if generation_length >= max_length:
    return params.exceed_length_penalty
  elif correctness == 1:
    # Reward decreases slightly as length increases for correct answers
    return cosine_schedule(generation_length, max_length, params.correct_min_reward, params.correct_max_reward)
  else: # correctness == 0
    # Penalty decreases (reward increases towards 0) as length increases for wrong answers
    return cosine_schedule(generation_length, max_length, params.wrong_min_reward, params.wrong_max_reward)

# Example Parameters (Conceptual):
# params.correct_max_reward = +2 (for length 0)
# params.correct_min_reward = +1 (for max_length)
# params.wrong_min_reward = -10 (for length 0)
# params.wrong_max_reward = 0 (for max_length)
# params.exceed_length_penalty = -10

Reward Hacking and Repetition Penalty: With sufficient training, models might "hack" length-based rewards by repeating content. An N-gram repetition penalty, applied per-token rather than sparsely at the end, effectively mitigates this.
Optimal Discount Factors: Different reward components benefit from different discount factors ( $\gamma$ ). The correctness reward works better with a high $\gamma$ (close to 1) to propagate value across long CoTs, while the repetition penalty is more effective with a lower $\gamma$ (e.g., 0.99) for more localized feedback. This suggests modifying the Generalized Advantage Estimation (GAE) calculation to handle multiple rewards with distinct $\gamma$ values. Setting $\gamma$ too low for the correctness reward can lead to "short-term thinking," where the model excessively branches or gives up on paths too quickly.
Context Window: Simply increasing the context window size doesn't automatically improve performance; models require sufficient training compute (more samples/iterations) to learn to effectively utilize the larger context.

Scaling Verifiable Rewards with Noisy Data:
- Importance of Verifiable Rewards: Outcome-based rewards that can be verified (e.g., checking math answers against ground truth) are crucial for stable long CoT RL, avoiding reward hacking issues common with learned reward models.
- Using Web-Extracted Data (WebInstruct): High-quality verifiable data is scarce. The paper explores using noisy, large-scale web-extracted QA data (WebInstruct (Hu et al., 20 May 2024 )).
- SFT with Mixed Data: Adding diverse, noisy WebInstruct data during SFT (e.g., 50% MATH, 50% WebInstruct) improves generalization, especially on out-of-distribution (OOD) benchmarks like MMLU-Pro, compared to using only high-quality MATH data.
- RL with Noisy Data: For RL using noisy WebInstruct data, filtering the dataset to include only problems with extractable short-form answers and using a rule-based verifier yielded the best results. If filtering isn't feasible (e.g., free-form answers), a model-based verifier performs better than a rule-based one. This provides a practical pathway to scale RL using readily available web data, significantly boosting OOD performance even compared to models trained only on gold-standard data.
RL from Base Models and Emergence:
- Nuances in Emergence: The paper cautions against simple interpretations of "emergent" behaviors (like self-correction keywords) during RL from base models. Such behaviors might already exist latently in the base model, and RL might not significantly increase their frequency even while improving accuracy. Length scaling might also be influenced by KL divergence penalties rather than genuine capability gains, especially in smaller models.
- SFT vs. Base Model RL: On Qwen2.5-Math-7B, initializing RL from a long CoT SFT model significantly outperformed RL applied directly to the base model, suggesting SFT provides a more effective starting point for developing complex reasoning via RL.
- Origins of Long CoT: Long CoT abilities like branching and error correction might stem from patterns learned during pre-training on human dialogue data (e.g., forum discussions on the web). RL might be guiding the model to recombine these latent skills effectively.

Overall Practical Guidance:

To train LLMs for complex reasoning, start with SFT on high-quality, emergent long CoT data distilled from capable models.
Follow up with RL using PPO. Implement reward shaping (like the Cosine Reward) to stabilize length and a repetition penalty to prevent hacking.
Carefully tune discount factors ( $\gamma$ ) for different reward components (high for correctness, low for penalties).
Augment limited gold-standard data (like MATH) with large-scale, noisier web data (like WebInstruct) for both SFT and RL. When using noisy data for RL, prefer filtering for verifiable (e.g., short-form) answers combined with rule-based verifiers.
Be critical when analyzing emergent behaviors during RL, especially with smaller base models; correlate observations with metrics like KL divergence. SFT initialization generally leads to better RL outcomes than direct RL from the base model for complex reasoning.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Edward Yeo (1 paper)
Yuxuan Tong (4 papers)
Morry Niu (2 papers)
Graham Neubig (342 papers)
Xiang Yue (72 papers)

Related Papers

Find Related Papers

GitHub

GitHub - eddycmu/demystify-long-cot (5 stars)

Tweets

https://twitter.com/xiangyue96/status/1887332772198371514

https://twitter.com/gm8xx8/status/1887348099359654004

https://twitter.com/omarsar0/status/1887984078978216245

https://twitter.com/xiangyue96/status/1887332818490953752

https://twitter.com/ChengleiSi/status/1889142126581358678

https://twitter.com/fly51fly/status/1887632431064457526

Demystifying Long Chain-of-Thought Reasoning in LLMs (2502.03373v1)

Related Papers

GitHub

Tweets

YouTube

HackerNews

Reddit