Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Emergence of Thinking in LLMs I: Searching for the Right Intuition (2502.06773v1)

Published 10 Feb 2025 in cs.AI, cs.CL, and cs.LG

Abstract: Recent AI advancements, such as OpenAI's new models, are transforming LLMs into LRMs (Large Reasoning Models) that perform reasoning during inference, taking extra time and compute for higher-quality outputs. We aim to uncover the algorithmic framework for training LRMs. Methods like self-consistency, PRM, and AlphaZero suggest reasoning as guided search. We ask: what is the simplest, most scalable way to enable search in LLMs? We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP). RLSP involves three steps: (1) supervised fine-tuning with human or synthetic demonstrations of the reasoning process, (2) using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and (3) RL training with an outcome verifier to ensure correctness while preventing reward hacking. Our key innovation is to decouple exploration and correctness signals during PPO training, carefully balancing them to improve performance and efficiency. Empirical studies in the math domain show that RLSP improves reasoning. On the Llama-3.1-8B-Instruct model, RLSP can boost performance by 23% in MATH-500 test set; On AIME 2024 math problems, Qwen2.5-32B-Instruct improved by 10% due to RLSP. However, a more important finding of this work is that the models trained using RLSP, even with the simplest exploration reward that encourages the model to take more intermediate steps, showed several emergent behaviors such as backtracking, exploration of ideas, and verification. These findings demonstrate that RLSP framework might be enough to enable emergence of complex reasoning abilities in LLMs when scaled. Lastly, we propose a theory as to why RLSP search strategy is more suitable for LLMs inspired by a remarkable result that says CoT provably increases computational power of LLMs, which grows as the number of steps in CoT \cite{li2024chain,merrill2023expresssive}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Guanghao Ye (9 papers)
  2. Khiem Duc Pham (1 paper)
  3. Xinzhi Zhang (12 papers)
  4. Sivakanth Gopi (37 papers)
  5. Baolin Peng (72 papers)
  6. Beibin Li (16 papers)
  7. Janardhan Kulkarni (52 papers)
  8. Huseyin A. Inan (23 papers)

Summary

Reinforcement Learning via Self-Play: Emergence of Thinking in LLMs

The paper "On the Emergence of Thinking in LLMs I: Searching for the Right Intuition" proposes a novel approach to enhance the reasoning capabilities of LLMs by transforming them into Large Reasoning Models (LRMs). The authors introduce a framework known as Reinforcement Learning via Self-Play (RLSP), which builds upon the conceptual foundation that reasoning is effectively a form of guided search. This research marks an advancement in AI by integrating structured reasoning processes into LLMs during their inference phase to achieve better quality outputs, thereby enabling these models to perform high-level cognitive tasks akin to "thinking."

Key Approaches and Contributions

The RLSP framework is articulated in three fundamental steps:

  1. Supervised Fine-Tuning (SFT): This step involves refining base models using high-quality demonstrations of reasoning processes. These can be human-annotated or synthetically generated via techniques like tree search, which serves as an initial step to equip models with elementary reasoning templates.
  2. Exploration Reward in Reinforcement Learning: The authors propose an exploration reward model that is distinct from correctness. It penalizes brevity and encourages models to undertake varied reasoning trajectories, exploring alternative solutions or comprehensive intermediate steps. This signal is crucial in guiding models to develop emergent strategic thinking behaviors like self-correction, backtracking, and verification, even with a simple length-based penalty.
  3. Outcome Verification with PPO Training: Leveraging Proximal Policy Optimization (PPO), the framework combines exploration rewards with binary outcome verifications—ensuring model responses are not only exploratory but also grounded in correctness as validated by a verifier.

Empirical Evaluation and Results

The authors evaluated the RLSP framework across various model sizes and domains, documenting significant improvements in mathematical reasoning tasks:

  • Performance Gains: Application of RLSP to Llama-3.1-8B-Instruct showed a 23% improvement in the MATH-500 test set. Furthermore, the Qwen2.5-32B-Instruct model exhibited a 10% gain on AIME 2024 problems, demonstrating the framework's robustness across different datasets and model architectures.
  • Emergent Behaviors: Training models with RLSP, specifically under simple exploration rewards, led to spontaneous development of complex problem-solving behaviors such as backtracking, exploration of multiple possibilities, and verification processes. These are indicative of enhanced reasoning strategies not typically present in baseline models.
  • Efficiency in Resource Utilization: The paper highlights that RLSP-trained models, utilizing a fixed token budget, outperform self-consistency approaches that rely on generating multiple completion samples. This efficiency underscores the framework's scalability and potential for wider deployment without excessive computational overhead.

Theoretical Implications and Future Directions

The RLSP framework posits that generating synthetic reasoning trajectories can enhance learning in LLMs, providing an opportunity to continuously improve through self-play. The theoretical insight aligns with recent findings on the chain-of-thought (CoT) influence in enhancing the computational power of transformers. By promoting exploration and longer reasoning chains, RLSP capitalizes on this principle, potentially paving the way for LLMs to autonomously generate high-quality CoT data.

This paper also raises intriguing questions for future research directions, including the scalability of RLSP in even larger models and unexplored domains beyond mathematics and structured reasoning. Moreover, a deeper exploration into the theoretical underpinnings of why certain reasoning behaviors emerge distinctly only under RLSP could further inform improvements in model training paradigms.

In conclusion, the research provides a structured pathway toward imbuing LLMs with enhanced reasoning capabilities, bridging a crucial gap between static model outputs and dynamic cognitive processes emblematic of human reasoning. The RLSP framework not only promises practical improvements in specific domains like mathematics but also lays foundational work for broader applications in decision-making and strategic planning tasks within Artificial Intelligence research.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. Reinforcement Learning via Self-Play (3 points, 0 comments)