SAGE-32B: Agentic 32B Language Model
- SAGE-32B is a 32 billion parameter language model focused on agentic reasoning, task decomposition, and iterative error recovery.
- It employs a multi-stage iterative distillation process, reflective training, and a meta-cognition head for inverse reasoning and failure forecasting.
- Empirical benchmarks show that SAGE-32B achieves superior multi-tool use and error recovery performance compared to similar models.
SAGE-32B is a 32 billion parameter LLM specifically optimized for agentic reasoning, long-range planning, and robust tool-use scenarios. Unlike general conversational LLMs, SAGE-32B is architected for operation within an agentic loop, emphasizing explicit task decomposition, tool invocation, and iterative error recovery. Developed by extending the Qwen2.5-32B decoder-only transformer, SAGE-32B incorporates novel modules, a multi-stage iterative distillation process, and introduces an auxiliary meta-cognition head for inverse reasoning and online failure forecasting. Publicly released under the SAGE-AI initiative, it is empirically evaluated on major agentic reasoning benchmarks, demonstrating superior performance in multi-tool use and error recovery compared to structurally similar and larger models (Jha et al., 4 Jan 2026).
1. Architecture and Initialization
SAGE-32B is constructed by augmenting the Qwen2.5-32B backbone with specialized modules. The total parameter footprint approximates 32 × 10⁹, with the core transformer layers accounting for ≈31 billion, split-embeddings and gating heads for ≈300 million, and the Meta-Cognition Head (MCH) for ≈200 million parameters.
Weight initialization is performed via normal sampling: Layer normalization employs RMSNorm with : Feed-forward networks use SwiGLU gating: A split-embedding strategy provides context-dependent embeddings: where is selected by a learned classifier. For long-context processing, a landmark attention mechanism mixes dense local attention over the last 4,096 tokens and global "landmark" tokens (stride ), enabling efficient operation up to 128k context length.
The MCH is an auxiliary attention layer grafted onto the terminal transformer block. It accepts and outputs a confidence vector: providing probabilities for stepwise plan failures.
2. Iterative Distillation and Training Paradigm
SAGE-32B adopts a multi-stage Iterative Distillation & Amplification (IDA) regime, integrating both teacher-student distillation and offline self-correction:
2.1 Distillation & Amplification (IDA)
Initial fine-tuning is teacher-driven: a hybrid GPT-4o/DeepSeek model generates 5 million synthetic agentic rollouts in environments such as OSWorld and WebArena. Negative-constraint sampling provides three types of hard negatives (type error, hallucinated key, logic error) for every correct action. The primary loss is cross-entropy on correct/negative discrimination ().
2.2 Reflective Distillation
Building on Reflexion and Self-Refine, but applied fully offline, the student collects rollouts; failures are critiqued and revised by the teacher. Corrections and critiques are added to the training buffer. The reflective loss is
0
with learning implemented by DPO gradient steps: 1
2.3 DPO Preference Learning for Safety
A DPO-based loss enforces the refusal of unsafe tool calls (e.g., delete_database()), with penalty coefficient 1.
2.4 Reinforcement Learning (CodePPO)
Function-calling is cast as program synthesis. Rewards are provided for successful code execution and semantic intent matches; penalties are applied for hallucinated arguments and syntax errors.
2.5 Composite Objective
Across phases, the optimization objective is blended: 2 where 3 is standard next-token cross-entropy.
3. Meta-Cognition and Inverse Reasoning
The meta-cognition head (MCH) implements inverse reasoning and online failure detection.
3.1 Architecture
The MCH leverages the final transformer hidden state 4, with a distinct projection 5, to produce per-step confidences.
3.2 Failure Forecasting
For each candidate agentic step 6, the MCH computes
7
If 8 exceeds a threshold 9, SAGE-32B enters a "Reasoning Mode"—invoking further look-ahead and candidate evaluation.
3.3 Inverse Consistency Score (ICS)
ICS quantifies how reasoning traces (0) can reconstruct the original context (1): 2 A dual-head architecture shares backbone features to estimate this reconstruction likelihood, typically by KL divergence or log-likelihood metrics.
3.4 Hybrid Energy Re-Ranking
Given 3 candidate continuations, SAGE-32B re-ranks by
4
4. Agentic Loop: Task Decomposition and Error Recovery
Agentic operation in SAGE-32B proceeds as follows:
- Decomposition of complex requests into a DAG of atomic, dependency-annotated subtasks.
- Iterative agentic reasoning loop: a) Generate candidate step 5. b) Calculate 6 via MCH. c) If 7, perform Look-Ahead Simulation (LAS), sampling 8 continuations and ranking by 9. d) Execute tool calls and observe results.
- On tool errors (e.g., syntax), invoke the reflective policy for in-loop critique and correction ("critique loop").
Reflective Distillation instills robust offline failure recognition and recovery while the MCH/LAS act as online verifiers during inference, minimizing cumulative error propagation.
5. Benchmarking and Empirical Performance
SAGE-32B achieves significant empirical gains over its backbone and several industry baselines, especially in multi-tool and long-horizon agentic tasks.
5.1 Reasoning and Tool Use Benchmarks
| Benchmark | Qwen2.5-32B | SAGE-32B (Std) | SAGE-32B (Think, k=32) | Llama-3-70B | GPT-4-Turbo |
|---|---|---|---|---|---|
| MMLU-Pro | 71.5 | 75.6 | 79.3 | 68.9 | 63.7 |
| MATH-500 | 78.9 | 78.9 | 91.8* | 68.0 | 72.6 |
| AgentBench | 58.4 | 58.4 | 73.1 | 62.1 | 85.0 |
| GPQA | 50.5 | 48.0 | 48.0 | 51.0 | 53.6 |
| IFEval | 81.2 | 84.5 | 84.5 | 78.5 | 86.0 |
*Note: “Think” mode utilizes Majority-Vote@32 with ICS filtering, yielding MATH-500 improvements (±0.4% stdev).
5.2 Error Recovery and Efficiency
- Internal Recovery Rate (IRR): SAGE-32B reaches 76%, doubling the Qwen2.5-32B base (35%).
- AgentBench agentic modes:
- Fast: 58.4% @1.2s (1.0× cost)
- Slow: 73.1% @4.5s (3.8×)
- Hybrid: 71.8% @1.8s (1.4×)
5.3 Tool-Calling (Enterprise-500) Suite
| Model | Success Rate | Unforced Errors | Cost/1k eps |
|---|---|---|---|
| GPT-4-Turbo | 94.2% | 1.5% | $32.00 |
| Claude 3.5 | 92.8% | 2.1% | $15.00 |
| SAGE-32B Hybrid | 91.5% | 2.4% | $4.50 |
| Llama-3-70B | 85.0% | 8.2% | $6.00 |
| Qwen2.5-32B | 76.4% | 14.5% | $2.80 |
Hallucination rate ablation: SAGE-32B reduces hallucinations from 14.5% (base) to 5.2% after reflective distillation, and down to 2.4% post-RL.
6. Implementation and Availability
SAGE-32B is released under a research preview license, available at https://huggingface.co/sagea-ai/sage-reasoning-32b. Key implementation hyperparameters include: 64 layers, maximum context of 128k tokens, local attention window of 4,096, landmark stride $\mathrm{FFN}(x) = \mathrm{Swish}_\beta(xW_G) \odot (xW_1) W_2, \quad \mathrm{Swish}_\beta(z) = z\,\sigma(\beta z)$0, and distillation batch size ≈256. Training utilizes 5 million synthetic multi-step trajectories with negative sampling, covering both synthetic and real-world (OSWorld/WebArena) environments. The CodePPO RL reward structure empirically balances task success with penalties for argument hallucination and syntax violations.
For use with the HuggingFace Transformers library: $\mathrm{FFN}(x) = \mathrm{Swish}_\beta(xW_G) \odot (xW_1) W_2, \quad \mathrm{Swish}_\beta(z) = z\,\sigma(\beta z)$2 A plausible implication is that SAGE-32B's design—combining inverse reasoning, hybrid training, and explicit meta-cognition—marks an advancement for agentic LLMs operating under long-horizon, multi-step planning, and tool-use settings, while maintaining competitive cost efficiency (Jha et al., 4 Jan 2026).