Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search (2502.02508v3)

Published 4 Feb 2025 in cs.CL and cs.AI

Abstract: LLMs have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models are fully open-sourced.

Summary

The paper introduces Satori, a 7B LLM that employs reinforcement learning with a novel Chain-of-Action-Thought mechanism to enhance reasoning.
It uses a two-stage training paradigm combining format tuning via imitation learning with reinforcement learning using a restart and explore strategy.
Satori achieves state-of-the-art performance on math benchmarks and shows strong transferability to out-of-domain tasks compared to baseline models.

The paper introduces Satori, a 7B LLM trained with reinforcement learning to enhance reasoning via autoregressive search. The core innovation is the Chain-of-Action-Thought (COAT) mechanism, enabling the LLM to take meta-actions during problem-solving. The approach involves a two-stage training paradigm: format tuning and self-improvement.

The first stage, format tuning, internalizes the COAT reasoning format using a small-scale dataset. The second stage uses reinforcement learning for large-scale self-improvement. This approach leads to Satori, which achieves strong performance on mathematical reasoning benchmarks and exhibits generalization to out-of-domain tasks.

The contributions of the paper are:

Efficiency: Satori is a single LLM capable of autoregressive search without external guidance.
Effectiveness: Satori shows strong performance on mathematical reasoning tasks.
Generalizability: Satori exhibits strong transferability to out-of-domain tasks.

The Satori training framework consists of two stages: format tuning and self-improvement. Format tuning trains LLMs to emulate expert COAT trajectories through imitation learning. A multi-agent data synthesis framework is proposed that leverages three LLMs: a generator, a critic, and a reward model. The simplest imitation learning approach, behavior cloning, is adopted, which utilizes supervised fine-tuning to train the LLM policy on the expert COAT demonstration trajectories.

The second stage is self-improvement via reinforcement learning. The format-tuned LLM is trained using Proximal Policy Optimization (PPO), a widely used RL method. The "restart and explore" (RAE) strategy, inspired by Go-explore, is introduced. The model restarts from intermediate steps, including points where previous reasoning attempts failed, to focus on correcting errors rather than starting from scratch. Exploration bonuses are added to encourage reflection.

The overall reward function $r(z, \tilde{y})$ is defined as:

$r(z, \tilde{y}) = r_{rule}(\tilde{y}_L, y^*) + o(r_{\psi}(z, \tilde{y})) + r_{bonus}(z, \tilde{y})$

Where:

$z$ is the initial state $\in D_{restart}$
$\tilde{y}$ is the sampled trajectory
$r_{rule}$ is the rule-based reward
$\tilde{y}_L$ is the final answer
$y^*$ is the ground truth
$o(r_{\psi}(z, \tilde{y}))$ is the output of the outcome reward model (ORM)
$r_{bonus}$ is the reflection bonus

To mitigate the issue of the policy converging to a local sub-optimum, an iterative self-improvement strategy is proposed. After each round of RL training, the knowledge of the current well-optimized policy is distilled into the base model through supervised fine-tuning (SFT).

Satori-Qwen-7B requires less supervision (small-scale FT) and relies more on self-improvement (large-scale RL). Satori-Qwen-7B achieves state-of-the-art performance across five benchmarks, and outperforms Qwen-2.5-Math-7B-Instruct which uses the same base model Qwen-2.5-Math-7B. Satori-Qwen-7B (Round 2) demonstrates even stronger performance on hard tasks. Satori-Qwen-7B exhibits strong transferability across out-of-domain benchmarks and outperforms Qwen-2.5-Math-7B-Instruct by a large margin.

The benefits of COAT reasoning compared to classical CoT reasoning are demonstrated through an ablation paper. The performance of Qwen-7B-CoT is suboptimal compared to Satori-Qwen-7B and fails to surpass Qwen-2.5-Math-7B-Instruct, suggesting the advantages of COAT reasoning over CoT reasoning. Satori-Qwen demonstrates a stronger self-correction capability compared to Satori-Qwen-FT, which lacks the RL training stage. This self-correction capability extends to out-of-domain tasks. Through RL training, Satori improves policy accuracy and increases the average length of generated tokens with more RL training-time compute. Through RL training, Satori allocates more test-time compute to tackle more challenging problems. Satori-Qwen outperforms same base model Qwen-2.5-Math-7B trained with 300K FT data (w/o RL) across all math and out-of-domain benchmarks. Distilling from a stronger model (Satori-Qwen-7B) to weaker base models (Llama-8B and Granite-8B) are more effective than directly applying format tuning on weaker base models.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1886999289903145327

https://twitter.com/gan_chuang/status/1886990696021656018

https://twitter.com/fly51fly/status/1887277916402368625

https://twitter.com/Athekunal/status/1914198846479163546

https://twitter.com/TheTuringPost/status/1889313649493463490

https://twitter.com/suzuki_hironobu/status/1887241964384624657

YouTube

Show All Videos