- The paper introduces Satori, a 7B LLM that employs reinforcement learning with a novel Chain-of-Action-Thought mechanism to enhance reasoning.
- It uses a two-stage training paradigm combining format tuning via imitation learning with reinforcement learning using a restart and explore strategy.
- Satori achieves state-of-the-art performance on math benchmarks and shows strong transferability to out-of-domain tasks compared to baseline models.
The paper introduces Satori, a 7B LLM trained with reinforcement learning to enhance reasoning via autoregressive search. The core innovation is the Chain-of-Action-Thought (COAT) mechanism, enabling the LLM to take meta-actions during problem-solving. The approach involves a two-stage training paradigm: format tuning and self-improvement.
The first stage, format tuning, internalizes the COAT reasoning format using a small-scale dataset. The second stage uses reinforcement learning for large-scale self-improvement. This approach leads to Satori, which achieves strong performance on mathematical reasoning benchmarks and exhibits generalization to out-of-domain tasks.
The contributions of the paper are:
- Efficiency: Satori is a single LLM capable of autoregressive search without external guidance.
- Effectiveness: Satori shows strong performance on mathematical reasoning tasks.
- Generalizability: Satori exhibits strong transferability to out-of-domain tasks.
The Satori training framework consists of two stages: format tuning and self-improvement. Format tuning trains LLMs to emulate expert COAT trajectories through imitation learning. A multi-agent data synthesis framework is proposed that leverages three LLMs: a generator, a critic, and a reward model. The simplest imitation learning approach, behavior cloning, is adopted, which utilizes supervised fine-tuning to train the LLM policy on the expert COAT demonstration trajectories.
The second stage is self-improvement via reinforcement learning. The format-tuned LLM is trained using Proximal Policy Optimization (PPO), a widely used RL method. The "restart and explore" (RAE) strategy, inspired by Go-explore, is introduced. The model restarts from intermediate steps, including points where previous reasoning attempts failed, to focus on correcting errors rather than starting from scratch. Exploration bonuses are added to encourage reflection.
The overall reward function r(z,y~) is defined as:
r(z,y~)=rrule(y~L,y∗)+o(rψ(z,y~))+rbonus(z,y~)
Where:
- z is the initial state ∈Drestart
- y~ is the sampled trajectory
- rrule is the rule-based reward
- y~L is the final answer
- y∗ is the ground truth
- o(rψ(z,y~)) is the output of the outcome reward model (ORM)
- rbonus is the reflection bonus
To mitigate the issue of the policy converging to a local sub-optimum, an iterative self-improvement strategy is proposed. After each round of RL training, the knowledge of the current well-optimized policy is distilled into the base model through supervised fine-tuning (SFT).
Satori-Qwen-7B requires less supervision (small-scale FT) and relies more on self-improvement (large-scale RL). Satori-Qwen-7B achieves state-of-the-art performance across five benchmarks, and outperforms Qwen-2.5-Math-7B-Instruct which uses the same base model Qwen-2.5-Math-7B. Satori-Qwen-7B (Round 2) demonstrates even stronger performance on hard tasks. Satori-Qwen-7B exhibits strong transferability across out-of-domain benchmarks and outperforms Qwen-2.5-Math-7B-Instruct by a large margin.
The benefits of COAT reasoning compared to classical CoT reasoning are demonstrated through an ablation paper. The performance of Qwen-7B-CoT is suboptimal compared to Satori-Qwen-7B and fails to surpass Qwen-2.5-Math-7B-Instruct, suggesting the advantages of COAT reasoning over CoT reasoning. Satori-Qwen demonstrates a stronger self-correction capability compared to Satori-Qwen-FT, which lacks the RL training stage. This self-correction capability extends to out-of-domain tasks. Through RL training, Satori improves policy accuracy and increases the average length of generated tokens with more RL training-time compute. Through RL training, Satori allocates more test-time compute to tackle more challenging problems. Satori-Qwen outperforms same base model Qwen-2.5-Math-7B trained with 300K FT data (w/o RL) across all math and out-of-domain benchmarks. Distilling from a stronger model (Satori-Qwen-7B) to weaker base models (Llama-8B and Granite-8B) are more effective than directly applying format tuning on weaker base models.