Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search (2502.02508v3)

Published 4 Feb 2025 in cs.CL and cs.AI

Abstract: LLMs have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models are fully open-sourced.

Summary

  • The paper introduces Satori, a 7B LLM that employs reinforcement learning with a novel Chain-of-Action-Thought mechanism to enhance reasoning.
  • It uses a two-stage training paradigm combining format tuning via imitation learning with reinforcement learning using a restart and explore strategy.
  • Satori achieves state-of-the-art performance on math benchmarks and shows strong transferability to out-of-domain tasks compared to baseline models.

The paper introduces Satori, a 7B LLM trained with reinforcement learning to enhance reasoning via autoregressive search. The core innovation is the Chain-of-Action-Thought (COAT) mechanism, enabling the LLM to take meta-actions during problem-solving. The approach involves a two-stage training paradigm: format tuning and self-improvement.

The first stage, format tuning, internalizes the COAT reasoning format using a small-scale dataset. The second stage uses reinforcement learning for large-scale self-improvement. This approach leads to Satori, which achieves strong performance on mathematical reasoning benchmarks and exhibits generalization to out-of-domain tasks.

The contributions of the paper are:

  • Efficiency: Satori is a single LLM capable of autoregressive search without external guidance.
  • Effectiveness: Satori shows strong performance on mathematical reasoning tasks.
  • Generalizability: Satori exhibits strong transferability to out-of-domain tasks.

The Satori training framework consists of two stages: format tuning and self-improvement. Format tuning trains LLMs to emulate expert COAT trajectories through imitation learning. A multi-agent data synthesis framework is proposed that leverages three LLMs: a generator, a critic, and a reward model. The simplest imitation learning approach, behavior cloning, is adopted, which utilizes supervised fine-tuning to train the LLM policy on the expert COAT demonstration trajectories.

The second stage is self-improvement via reinforcement learning. The format-tuned LLM is trained using Proximal Policy Optimization (PPO), a widely used RL method. The "restart and explore" (RAE) strategy, inspired by Go-explore, is introduced. The model restarts from intermediate steps, including points where previous reasoning attempts failed, to focus on correcting errors rather than starting from scratch. Exploration bonuses are added to encourage reflection.

The overall reward function r(z,y~)r(z, \tilde{y}) is defined as:

r(z,y~)=rrule(y~L,y)+o(rψ(z,y~))+rbonus(z,y~)r(z, \tilde{y}) = r_{rule}(\tilde{y}_L, y^*) + o(r_{\psi}(z, \tilde{y})) + r_{bonus}(z, \tilde{y})

Where:

  • zz is the initial state Drestart\in D_{restart}
  • y~\tilde{y} is the sampled trajectory
  • rruler_{rule} is the rule-based reward
  • y~L\tilde{y}_L is the final answer
  • yy^* is the ground truth
  • o(rψ(z,y~))o(r_{\psi}(z, \tilde{y})) is the output of the outcome reward model (ORM)
  • rbonusr_{bonus} is the reflection bonus

To mitigate the issue of the policy converging to a local sub-optimum, an iterative self-improvement strategy is proposed. After each round of RL training, the knowledge of the current well-optimized policy is distilled into the base model through supervised fine-tuning (SFT).

Satori-Qwen-7B requires less supervision (small-scale FT) and relies more on self-improvement (large-scale RL). Satori-Qwen-7B achieves state-of-the-art performance across five benchmarks, and outperforms Qwen-2.5-Math-7B-Instruct which uses the same base model Qwen-2.5-Math-7B. Satori-Qwen-7B (Round 2) demonstrates even stronger performance on hard tasks. Satori-Qwen-7B exhibits strong transferability across out-of-domain benchmarks and outperforms Qwen-2.5-Math-7B-Instruct by a large margin.

The benefits of COAT reasoning compared to classical CoT reasoning are demonstrated through an ablation paper. The performance of Qwen-7B-CoT is suboptimal compared to Satori-Qwen-7B and fails to surpass Qwen-2.5-Math-7B-Instruct, suggesting the advantages of COAT reasoning over CoT reasoning. Satori-Qwen demonstrates a stronger self-correction capability compared to Satori-Qwen-FT, which lacks the RL training stage. This self-correction capability extends to out-of-domain tasks. Through RL training, Satori improves policy accuracy and increases the average length of generated tokens with more RL training-time compute. Through RL training, Satori allocates more test-time compute to tackle more challenging problems. Satori-Qwen outperforms same base model Qwen-2.5-Math-7B trained with 300K FT data (w/o RL) across all math and out-of-domain benchmarks. Distilling from a stronger model (Satori-Qwen-7B) to weaker base models (Llama-8B and Granite-8B) are more effective than directly applying format tuning on weaker base models.

Youtube Logo Streamline Icon: https://streamlinehq.com