Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs (2503.01307v1)

Published 3 Mar 2025 in cs.CL and cs.LG

Abstract: Test-time inference has emerged as a powerful paradigm for enabling LLMs to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in LLMs on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful LLMs employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some LLMs effectively utilize additional computation while others plateau.

Summary

  • The paper demonstrates that baseline cognitive behaviors, like backtracking and verification, are crucial for effective self-improvement via reinforcement learning.
  • It shows that targeted supervised fine-tuning and pretraining data curation significantly amplify these behaviors, leading to marked gains in complex reasoning tasks.
  • The analysis offers actionable insights for model development, emphasizing the importance of inherent reasoning strategies before applying reinforcement learning.

This paper investigates why some LLMs (LMs) significantly improve their reasoning capabilities through reinforcement learning (RL) while others show minimal gains, even under identical training conditions (2503.01307). The core finding is that the initial presence of specific "cognitive behaviors" in a base model is crucial for enabling effective self-improvement.

The paper focuses on the Countdown mathematical puzzle game as a testbed for reasoning. It compares two similarly sized models, Qwen-2.5-3B and Llama-3.2-3B, trained using Proximal Policy Optimization (PPO). Qwen shows substantial performance gains and generates longer, more complex reasoning traces after RL, whereas Llama plateaus early.

To understand this difference, the researchers propose a framework analyzing four key cognitive behaviors in the models' outputs:

  1. Verification: Systematically checking intermediate results (e.g., "Let's check: (30-25+3)*4 = (5+3)*4 = 8*4 = 32. Correct.").
  2. Backtracking: Explicitly abandoning failing approaches and trying alternatives (e.g., "Trying 30*3 = 90... too high. Let's try (30-25)...").
  3. Subgoal Setting: Breaking down the problem into smaller, manageable steps (e.g., "First, let's try to get a number close to 32 using 25 and 30...").
  4. Backward Chaining: Reasoning from the desired outcome back towards the inputs (e.g., "To get 32, we could multiply by 4. Can we make 8 from 25, 30, 3?").

Initial analysis revealed that the base Qwen model naturally exhibited these behaviors, especially verification and backtracking, at a much higher rate than the base Llama model. This led to the hypothesis that these initial behavioral tendencies are necessary for RL to successfully amplify productive reasoning strategies that utilize increased computation time (longer outputs).

The paper tests this hypothesis through two main interventions:

1. Priming via Supervised Fine-Tuning (SFT):

  • Method: Llama was fine-tuned on small datasets (1000 examples) containing synthetic Countdown solutions generated by Claude-3.5-Sonnet. These datasets were specifically crafted to exhibit target behaviors (e.g., "backtracking only," "backtracking + verification," "all strategies").
  • Implementation: SFT was performed for 5 epochs using AdamW with a learning rate of 1e-5. See Appendix B for details.
  • Results: Priming Llama, particularly with traces containing backtracking, enabled it to match or exceed Qwen's performance improvement during subsequent RL training. RL then selectively amplified the useful behaviors (backtracking, verification) while suppressing others that were less effective for Countdown (subgoal setting, backward chaining).
  • Crucial Controls:
    • Priming with empty or placeholder chains-of-thought did not improve Llama's RL performance, confirming that the specific behaviors, not just longer context, were necessary.
    • Priming Llama with incorrect solutions that still demonstrated the correct reasoning patterns (e.g., backtracking, verification) yielded RL performance gains comparable to priming with correct solutions. This strongly indicates that the presence of the reasoning behavior itself is the critical factor enabling self-improvement, more so than the correctness of the initial examples.

2. Continued Pretraining on Curated Data:

  • Method: Recognizing that priming might be domain-specific, the researchers investigated modifying the pretraining data. They analyzed OpenWebMath and found cognitive behaviors like backtracking to be infrequent. They used a classifier model (Qwen-2.5-32B) to identify documents in OpenWebMath rich in the target behaviors and created a "behavior-enriched" dataset (8.3M tokens). A control dataset minimized these behaviors. Llama was then subjected to continued pretraining on these datasets before RL.
  • Implementation: Documents were classified for behaviors, rewritten into a question-thought-answer format (preserving/excluding behaviors), and used for continued pretraining. See Appendix E for details.
  • Results: Llama pretrained on the behavior-enriched dataset achieved RL performance similar to Qwen. The control Llama model showed little improvement. This demonstrates that the capacity for self-improvement can be engineered by curating pretraining data to include desired cognitive patterns.

Behavioral Analysis Pipeline:

A key part of the methodology is the automated analysis of model outputs for the four cognitive behaviors. This was done using a classification pipeline driven by a capable LLM (GPT-4o-mini or Qwen-2.5-32B). The classifier was prompted with definitions and examples of each behavior and asked to count occurrences in a given reasoning trace. This pipeline was used to analyze base model tendencies, track behavior emergence during RL, and curate pretraining data. See Appendix D and E.

Practical Implications:

  • Model Selection/Development: When selecting models for tasks requiring complex reasoning and potential self-improvement via RL, assess their inherent tendency to exhibit behaviors like backtracking and verification. Models lacking these may struggle to benefit from RL fine-tuning for reasoning.
  • Inducing Behaviors: If a base model lacks desired reasoning behaviors, they can be induced through:
    • Targeted SFT: Fine-tune the model on a small dataset of examples explicitly demonstrating the desired patterns (e.g., using synthetic data generated by a more capable model). Even incorrect examples can work if they show the right behaviors.
    • Pretraining Data Curation: Enrich the pretraining or continued pretraining data with documents exhibiting the target cognitive behaviors. This might offer better generalization than task-specific SFT.
  • RL for Reasoning: The success of RL methods like PPO or GRPO for improving reasoning heavily depends on the initial policy's ability to explore relevant behavioral patterns. The RL process acts more as an amplifier of existing (even if infrequent) useful behaviors rather than discovering them entirely from scratch.
  • Evaluation: Evaluating models solely on final answer correctness might miss crucial differences in their underlying reasoning processes and potential for future improvement. Analyzing the structure and behaviors within the reasoning trace provides deeper insights.

In conclusion, this work establishes a causal link between specific initial cognitive behaviors in LMs and their ability to self-improve on reasoning tasks via RL. It provides practical strategies—targeted SFT and pretraining data curation—for engineering these behaviors into models, thereby enhancing their capacity to effectively utilize computational resources for complex problem-solving.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews