SAGE-32B: Agentic 32B Language Model

Updated 11 May 2026

SAGE-32B is a 32 billion parameter language model focused on agentic reasoning, task decomposition, and iterative error recovery.
It employs a multi-stage iterative distillation process, reflective training, and a meta-cognition head for inverse reasoning and failure forecasting.
Empirical benchmarks show that SAGE-32B achieves superior multi-tool use and error recovery performance compared to similar models.

SAGE-32B is a 32 billion parameter LLM specifically optimized for agentic reasoning, long-range planning, and robust tool-use scenarios. Unlike general conversational LLMs, SAGE-32B is architected for operation within an agentic loop, emphasizing explicit task decomposition, tool invocation, and iterative error recovery. Developed by extending the Qwen2.5-32B decoder-only transformer, SAGE-32B incorporates novel modules, a multi-stage iterative distillation process, and introduces an auxiliary meta-cognition head for inverse reasoning and online failure forecasting. Publicly released under the SAGE-AI initiative, it is empirically evaluated on major agentic reasoning benchmarks, demonstrating superior performance in multi-tool use and error recovery compared to structurally similar and larger models (Jha et al., 4 Jan 2026).

1. Architecture and Initialization

SAGE-32B is constructed by augmenting the Qwen2.5-32B backbone with specialized modules. The total parameter footprint approximates 32 × 10⁹, with the core transformer layers accounting for ≈31 billion, split-embeddings and gating heads for ≈300 million, and the Meta-Cognition Head (MCH) for ≈200 million parameters.

Weight initialization is performed via normal sampling: $W \sim \mathcal{N}\left(0, \sigma^2 \right),\quad \sigma = \frac{0.02}{\sqrt{2L}},\;L=64$ Layer normalization employs RMSNorm with $\epsilon=10^{-6}$ : $\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ Feed-forward networks use SwiGLU gating: $\mathrm{FFN}(x) = \mathrm{Swish}_\beta(xW_G) \odot (xW_1) W_2, \quad \mathrm{Swish}_\beta(z) = z\,\sigma(\beta z)$ A split-embedding strategy provides context-dependent embeddings: $E_{input}^{(t)} = \alpha_t E_{NL}(x_t) + (1-\alpha_t)E_{Code}(x_t),\quad \alpha_t \in [0,1]$ where $\alpha_t$ is selected by a learned classifier. For long-context processing, a landmark attention mechanism mixes dense local attention over the last 4,096 tokens and global "landmark" tokens (stride $k=64$ ), enabling efficient operation up to 128k context length.

The MCH is an auxiliary attention layer grafted onto the terminal transformer block. It accepts $h_{last}$ and outputs a confidence vector: $c_t = \sigma(W_{MCH} h_{last} + b)$ providing probabilities for stepwise plan failures.

2. Iterative Distillation and Training Paradigm

SAGE-32B adopts a multi-stage Iterative Distillation & Amplification (IDA) regime, integrating both teacher-student distillation and offline self-correction:

2.1 Distillation & Amplification (IDA)

Initial fine-tuning is teacher-driven: a hybrid GPT-4o/DeepSeek model generates 5 million synthetic agentic rollouts in environments such as OSWorld and WebArena. Negative-constraint sampling provides three types of hard negatives (type error, hallucinated key, logic error) for every correct action. The primary loss is cross-entropy on correct/negative discrimination ( $L_{KD}$ ).

2.2 Reflective Distillation

Building on Reflexion and Self-Refine, but applied fully offline, the student collects rollouts; failures are critiqued and revised by the teacher. Corrections and critiques are added to the training buffer. The reflective loss is

$\epsilon=10^{-6}$ 0

with learning implemented by DPO gradient steps: $\mathrm{FFN}(x) = \mathrm{Swish}_\beta(xW_G) \odot (xW_1) W_2, \quad \mathrm{Swish}_\beta(z) = z\,\sigma(\beta z)$ 1

2.3 DPO Preference Learning for Safety

A DPO-based loss enforces the refusal of unsafe tool calls (e.g., delete_database()), with penalty coefficient $\epsilon=10^{-6}$ 1.

2.4 Reinforcement Learning (CodePPO)

Function-calling is cast as program synthesis. Rewards are provided for successful code execution and semantic intent matches; penalties are applied for hallucinated arguments and syntax errors.

2.5 Composite Objective

Across phases, the optimization objective is blended: $\epsilon=10^{-6}$ 2 where $\epsilon=10^{-6}$ 3 is standard next-token cross-entropy.

3. Meta-Cognition and Inverse Reasoning

The meta-cognition head (MCH) implements inverse reasoning and online failure detection.

3.1 Architecture

The MCH leverages the final transformer hidden state $\epsilon=10^{-6}$ 4, with a distinct projection $\epsilon=10^{-6}$ 5, to produce per-step confidences.

3.2 Failure Forecasting

For each candidate agentic step $\epsilon=10^{-6}$ 6, the MCH computes

$\epsilon=10^{-6}$ 7

If $\epsilon=10^{-6}$ 8 exceeds a threshold $\epsilon=10^{-6}$ 9, SAGE-32B enters a "Reasoning Mode"—invoking further look-ahead and candidate evaluation.

3.3 Inverse Consistency Score (ICS)

ICS quantifies how reasoning traces ( $\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ 0) can reconstruct the original context ( $\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ 1): $\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ 2 A dual-head architecture shares backbone features to estimate this reconstruction likelihood, typically by KL divergence or log-likelihood metrics.

3.4 Hybrid Energy Re-Ranking

Given $\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ 3 candidate continuations, SAGE-32B re-ranks by

$\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ 4

4. Agentic Loop: Task Decomposition and Error Recovery

Agentic operation in SAGE-32B proceeds as follows:

Decomposition of complex requests into a DAG of atomic, dependency-annotated subtasks.
Iterative agentic reasoning loop: a) Generate candidate step $\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ 5. b) Calculate $\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ 6 via MCH. c) If $\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ 7, perform Look-Ahead Simulation (LAS), sampling $\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ 8 continuations and ranking by $\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma$ 9. d) Execute tool calls and observe results.
On tool errors (e.g., syntax), invoke the reflective policy for in-loop critique and correction ("critique loop").

Reflective Distillation instills robust offline failure recognition and recovery while the MCH/LAS act as online verifiers during inference, minimizing cumulative error propagation.

5. Benchmarking and Empirical Performance

SAGE-32B achieves significant empirical gains over its backbone and several industry baselines, especially in multi-tool and long-horizon agentic tasks.

5.1 Reasoning and Tool Use Benchmarks

Benchmark	Qwen2.5-32B	SAGE-32B (Std)	SAGE-32B (Think, k=32)	Llama-3-70B	GPT-4-Turbo
MMLU-Pro	71.5	75.6	79.3	68.9	63.7
MATH-500	78.9	78.9	91.8*	68.0	72.6
AgentBench	58.4	58.4	73.1	62.1	85.0
GPQA	50.5	48.0	48.0	51.0	53.6
IFEval	81.2	84.5	84.5	78.5	86.0

*Note: “Think” mode utilizes Majority-Vote@32 with ICS filtering, yielding MATH-500 improvements (±0.4% stdev).

5.2 Error Recovery and Efficiency

Internal Recovery Rate (IRR): SAGE-32B reaches 76%, doubling the Qwen2.5-32B base (35%).
AgentBench agentic modes:
- Fast: 58.4% @1.2s (1.0× cost)
- Slow: 73.1% @4.5s (3.8×)
- Hybrid: 71.8% @1.8s (1.4×)

5.3 Tool-Calling (Enterprise-500) Suite

Model	Success Rate	Unforced Errors	Cost/1k eps
GPT-4-Turbo	94.2%	1.5%	$32.00
Claude 3.5	92.8%	2.1%	$15.00
SAGE-32B Hybrid	91.5%	2.4%	$4.50
Llama-3-70B	85.0%	8.2%	$6.00
Qwen2.5-32B	76.4%	14.5%	$2.80

Hallucination rate ablation: SAGE-32B reduces hallucinations from 14.5% (base) to 5.2% after reflective distillation, and down to 2.4% post-RL.

6. Implementation and Availability

SAGE-32B is released under a research preview license, available at https://huggingface.co/sagea-ai/sage-reasoning-32b. Key implementation hyperparameters include: 64 layers, maximum context of 128k tokens, local attention window of 4,096, landmark stride $\mathrm{FFN}(x) = \mathrm{Swish}_\beta(xW_G) \odot (xW_1) W_2, \quad \mathrm{Swish}_\beta(z) = z\,\sigma(\beta z)$0, and distillation batch size ≈256. Training utilizes 5 million synthetic multi-step trajectories with negative sampling, covering both synthetic and real-world (OSWorld/WebArena) environments. The CodePPO RL reward structure empirically balances task success with penalties for argument hallucination and syntax violations.

For use with the HuggingFace Transformers library: $\mathrm{FFN}(x) = \mathrm{Swish}_\beta(xW_G) \odot (xW_1) W_2, \quad \mathrm{Swish}_\beta(z) = z\,\sigma(\beta z)$2 A plausible implication is that SAGE-32B's design—combining inverse reasoning, hybrid training, and explicit meta-cognition—marks an advancement for agentic LLMs operating under long-horizon, multi-step planning, and tool-use settings, while maintaining competitive cost efficiency (Jha et al., 4 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SAGE-32B: Agentic Reasoning via Iterative Distillation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAGE-32B.

SAGE-32B: Agentic 32B Language Model

1. Architecture and Initialization

2. Iterative Distillation and Training Paradigm

2.1 Distillation & Amplification (IDA)

2.2 Reflective Distillation

2.3 DPO Preference Learning for Safety

2.4 Reinforcement Learning (CodePPO)

2.5 Composite Objective

3. Meta-Cognition and Inverse Reasoning

3.1 Architecture

3.2 Failure Forecasting

3.3 Inverse Consistency Score (ICS)

3.4 Hybrid Energy Re-Ranking

4. Agentic Loop: Task Decomposition and Error Recovery

5. Benchmarking and Empirical Performance

5.1 Reasoning and Tool Use Benchmarks

5.2 Error Recovery and Efficiency

5.3 Tool-Calling (Enterprise-500) Suite

6. Implementation and Availability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SAGE-32B: Agentic 32B Language Model

1. Architecture and Initialization

2. Iterative Distillation and Training Paradigm

2.1 Distillation & Amplification (IDA)

2.2 Reflective Distillation

2.3 DPO Preference Learning for Safety

2.4 Reinforcement Learning (CodePPO)

2.5 Composite Objective

3. Meta-Cognition and Inverse Reasoning

3.1 Architecture

3.2 Failure Forecasting

3.3 Inverse Consistency Score (ICS)

3.4 Hybrid Energy Re-Ranking

4. Agentic Loop: Task Decomposition and Error Recovery

5. Benchmarking and Empirical Performance

5.1 Reasoning and Tool Use Benchmarks

5.2 Error Recovery and Efficiency

5.3 Tool-Calling (Enterprise-500) Suite

6. Implementation and Availability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research