Papers
Topics
Authors
Recent
2000 character limit reached

Satori-Qwen-7B: Novel Reasoning LLM

Updated 25 November 2025
  • Satori-Qwen-7B is a 7-billion parameter LLM that integrates a Qwen-2.5-Math-7B backbone with meta-action tokens to enable discrete continuation, reflection, and exploration during autoregressive generation.
  • It employs the innovative Chain-of-Action-Thought (COAT) framework, interweaving stepwise reasoning with localized verification to overcome limitations of conventional chain-of-thought methods.
  • A two-stage training process—first format tuning with expert COAT demonstrations, then reinforcement learning with PPO—yields state-of-the-art performance on both math benchmarks and diverse out-of-domain tasks.

Satori-Qwen-7B is a 7-billion parameter LLM employing a specialized architecture and post-training regime that emphasizes internalized autoregressive search and self-reflection. Developed as part of the Satori initiative, this model combines the Qwen-2.5-Math-7B Transformer backbone with the Chain-of-Action-Thought (COAT) reasoning framework, leveraging both imitation and reinforcement learning to surpass prior methods in mathematical reasoning and generalize to out-of-domain (OOD) benchmarks. The approach distinctly integrates reasoning, verification, and exploration meta-actions within a single LLM without recourse to external verifiers or multi-agent ensembles, achieving state-of-the-art (SOTA) results across diverse tasks (Shen et al., 4 Feb 2025).

1. Model Architecture and Vocabulary Extensions

Satori-Qwen-7B is based on Qwen-2.5-Math-7B, a decoder-only Transformer architecture with 16 layers, 16 attention heads, and a context window of 4,096 tokens. The core model parameters and attention mechanisms remain unchanged relative to its base. The unique aspect distinguishing Satori-Qwen-7B is the augmentation of its vocabulary with three meta-action tokens: <|continue|>, <|reflect|>, and <|explore|>. These tokens enable discrete operational modalities during autoregressive generation, corresponding to continuation of the reasoning trajectory, initiation of stepwise reflection, and exploration of alternative solutions, respectively. No further modifications to the transformer blocks or attention patterns are present in this model; all task-specific behavior is mediated by policy fine-tuning and these action tokens (Shen et al., 4 Feb 2025).

2. Chain-of-Action-Thought (COAT) Reasoning

The COAT reasoning format constitutes an extension of traditional chain-of-thought (CoT) prompting strategies. Unlike linear CoT, which only supports progressive stepwise reasoning, COAT introduces the explicit ability to interweave continuation (<|continue|>), localized verification via reflection (<|reflect|>), and dynamic branching (<|explore|>) into the reasoning process. This permits the model to internalize both the search and validation that would otherwise require multi-agent scaffolding or external verifiers.

A typical COAT-annotated reasoning sequence can be illustrated as follows:

  • Prompt: "Solve 2+3+8 step by step. Insert <|reflect|> when checking, <|explore|> for new attempts."
  • Model Output:
    • Step 1 (<|continue|>): First, 2 + 3 = 5.
    • Step 2 (<|reflect|>): Let me verify that 5 is correct.
    • Step 3 (<|continue|>): Next, 5 + 8 = 13.
    • Final: Therefore, the answer is 13.

Fundamentally, COAT allows the model to autonomously decide when to inspect intermediate steps or to backtrack and retry alternate approaches, features that canonical CoT models lack (Shen et al., 4 Feb 2025).

3. Two-Stage Training Paradigm

3.1 Format Tuning Stage

Format tuning exposes the model to COAT-annotated demonstrations constructed through a multi-agent pipeline:

  • Generator: Qwen-2.5-Math-Instruct generates initial CoT solutions.
  • Critic: Llama-3.1-70B-Instruct labels errors and verifies solution correctness.
  • Reward Model: Skywork-Reward-Llama-3.1-8B scores and ranks trajectories.

This stage leverages 10,000 expert demonstration trajectories synthesized from a deduplicated, consistency-filtered math corpus (OpenMathInstruct-2 and NuminaMath-CoT, totaling 550,000 raw examples). The model is fine-tuned using cross-entropy loss with a batch size of 128, learning rate (LR) of 2e-5 (cosine schedule), two epochs, and maximum sequence length of 4,096 tokens.

3.2 Self-Improvement via Reinforcement Learning

The primary driver of Satori-Qwen-7B's capabilities is the large-scale reinforcement learning (RL) phase employing Proximal Policy Optimization (PPO):

  • Restart-and-Explore (RAE): RL operates on both full problems and "restart buffers" (partial trajectory prefixes), each appended with <|reflect|> to incentivize self-reflection.
  • Reward Function:

r(z,y)=rrule(yL,y∗)+o(y∣z,y)+rbonus(z,y)r(z, y) = r_{rule}(y_L, y^*) + o(y|z, y) + r_{bonus}(z, y)

Where: - rruler_{rule}: +1 for correct final answer, else 0. - o(⋅)o(\cdot): Preference reward from a Bradley–Terry model trained on 300,000 trajectory pairs (T=2T=2 margin). - rbonusr_{bonus}: Reflection bonus +β+\beta (β=0.5\beta=0.5) for successful corrections, −β-\beta for unnecessary revisions.

  • PPO Hyperparameters:
    • Actor LR: 2e-7, Critic LR: 5e-6, KL coefficient: 0.0
    • Batch size: 128, Rollout steps: 1,024, Sequence length: 2,048
    • Sampling temperature: 0.6, Reflection bonus: 0.5

After each RL round, the improved policy is distilled into the base model via supervised fine-tuning (SFT) on 180,000 top trajectories, followed by a second PPO iteration (temperature = 1.2, 8 samples per prompt) to maximize gains (Shen et al., 4 Feb 2025).

4. Training Data and Optimization Details

The Satori-Qwen-7B training regime utilizes both scripted and synthetic data sources:

  • Format Tuning (FT) Data: 10,000 COAT demonstration trajectories from synthetic dataset D_syn.
  • RL Data: 300,000 problems with associated restart buffers sampled from D_syn.
  • Data Processing: Raw mathematical problems (550,000) were deduplicated and subject to consistency filtering. Sources include OpenMathInstruct-2 and NuminaMath-CoT.
  • Optimization Hyperparameters: Detailed in the following table.
Stage Batch Size LR Epochs Max Seq-Len Notes
FT 128 2e-5 2 4096 Cosine schedule
ORM 128 2e-6 2 - Margin T=2T=2
PPO (Actor) 128 2e-7 - 2048 Temp 0.6
PPO (Critic) 128 5e-6 - 2048

The curriculum incorporates iterative refinement: model checkpoints are periodically distilled and subjected to further PPO optimization.

5. Evaluation on Mathematical and General Benchmarks

Satori-Qwen-7B achieves SOTA or competitive performance on a comprehensive suite of in-domain and OOD reasoning tasks, evaluated using zero-shot pass@1 greedy decoding.

In-domain Math Benchmarks

Benchmark Satori (%) Qwen-2.5-Math-Instruct (%)
GSM8K 93.2 95.2
MATH500 85.6 83.6
OlympiadBench 46.6 41.6
AMC2023 67.5 62.5
AIME2024 20.0 16.7

After round-2 RL, average accuracy increases from 62.6% to 64.4%, with AMC2023 reaching 72.5% and AIME2024 achieving 23.3%.

Out-of-Domain (OOD) Generalization

Benchmark Satori (%) Qwen-2.5-Math-Instruct (%)
FOLIO 71.4 68.9
BGQA 61.8 51.3
CRUXEval 42.5 28.0
StrategyQA 86.3 85.3
TableBench 43.4 36.2
STEM-MMLUPro 56.7 45.2

In round-2, the OOD average increases to 60.8%.

Quantitative Summary:

  • Math average: +2.7% over same-scale SFT baseline
  • OOD average: +8.2% across logical, code, tabular, and STEM tasks

The principal insight is that post-training with large-scale RL on the COAT format enables the model to internalize search and reflection, which supports both in-domain math SOTA and generalizable zero-shot transfer (Shen et al., 4 Feb 2025).

6. Insights and Significance

The Satori-Qwen-7B approach demonstrates that explicit instruction in multi-path reasoning—via meta-action tokens and reinforcement learning—enables an LLM to perform internalized autoregressive search, reducing reliance on external affiliates such as verifiers or multi-agent controllers. This design results in measurable gains in both math-specific and broader reasoning tasks. The use of reflection and exploration action primitives marks a methodological advance over simple CoT strategies, addressing limitations in linear reasoning pipelines. A plausible implication is that similar training protocols and action-based vocabulary augmentation could enhance other domains requiring complex reasoning and error correction mechanisms (Shen et al., 4 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Satori-Qwen-7B.