Papers
Topics
Authors
Recent
Search
2000 character limit reached

SAGE-32B: Agentic 32B Language Model

Updated 11 May 2026
  • SAGE-32B is a 32 billion parameter language model focused on agentic reasoning, task decomposition, and iterative error recovery.
  • It employs a multi-stage iterative distillation process, reflective training, and a meta-cognition head for inverse reasoning and failure forecasting.
  • Empirical benchmarks show that SAGE-32B achieves superior multi-tool use and error recovery performance compared to similar models.

SAGE-32B is a 32 billion parameter LLM specifically optimized for agentic reasoning, long-range planning, and robust tool-use scenarios. Unlike general conversational LLMs, SAGE-32B is architected for operation within an agentic loop, emphasizing explicit task decomposition, tool invocation, and iterative error recovery. Developed by extending the Qwen2.5-32B decoder-only transformer, SAGE-32B incorporates novel modules, a multi-stage iterative distillation process, and introduces an auxiliary meta-cognition head for inverse reasoning and online failure forecasting. Publicly released under the SAGE-AI initiative, it is empirically evaluated on major agentic reasoning benchmarks, demonstrating superior performance in multi-tool use and error recovery compared to structurally similar and larger models (Jha et al., 4 Jan 2026).

1. Architecture and Initialization

SAGE-32B is constructed by augmenting the Qwen2.5-32B backbone with specialized modules. The total parameter footprint approximates 32 × 10⁹, with the core transformer layers accounting for ≈31 billion, split-embeddings and gating heads for ≈300 million, and the Meta-Cognition Head (MCH) for ≈200 million parameters.

Weight initialization is performed via normal sampling: WN(0,σ2),σ=0.022L,  L=64W \sim \mathcal{N}\left(0, \sigma^2 \right),\quad \sigma = \frac{0.02}{\sqrt{2L}},\;L=64 Layer normalization employs RMSNorm with ϵ=106\epsilon=10^{-6}: RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma Feed-forward networks use SwiGLU gating: FFN(x)=Swishβ(xWG)(xW1)W2,Swishβ(z)=zσ(βz)\mathrm{FFN}(x) = \mathrm{Swish}_\beta(xW_G) \odot (xW_1) W_2, \quad \mathrm{Swish}_\beta(z) = z\,\sigma(\beta z) A split-embedding strategy provides context-dependent embeddings: Einput(t)=αtENL(xt)+(1αt)ECode(xt),αt[0,1]E_{input}^{(t)} = \alpha_t E_{NL}(x_t) + (1-\alpha_t)E_{Code}(x_t),\quad \alpha_t \in [0,1] where αt\alpha_t is selected by a learned classifier. For long-context processing, a landmark attention mechanism mixes dense local attention over the last 4,096 tokens and global "landmark" tokens (stride k=64k=64), enabling efficient operation up to 128k context length.

The MCH is an auxiliary attention layer grafted onto the terminal transformer block. It accepts hlasth_{last} and outputs a confidence vector: ct=σ(WMCHhlast+b)c_t = \sigma(W_{MCH} h_{last} + b) providing probabilities for stepwise plan failures.

2. Iterative Distillation and Training Paradigm

SAGE-32B adopts a multi-stage Iterative Distillation & Amplification (IDA) regime, integrating both teacher-student distillation and offline self-correction:

2.1 Distillation & Amplification (IDA)

Initial fine-tuning is teacher-driven: a hybrid GPT-4o/DeepSeek model generates 5 million synthetic agentic rollouts in environments such as OSWorld and WebArena. Negative-constraint sampling provides three types of hard negatives (type error, hallucinated key, logic error) for every correct action. The primary loss is cross-entropy on correct/negative discrimination (LKDL_{KD}).

2.2 Reflective Distillation

Building on Reflexion and Self-Refine, but applied fully offline, the student collects rollouts; failures are critiqued and revised by the teacher. Corrections and critiques are added to the training buffer. The reflective loss is

ϵ=106\epsilon=10^{-6}0

with learning implemented by DPO gradient steps: FFN(x)=Swishβ(xWG)(xW1)W2,Swishβ(z)=zσ(βz)\mathrm{FFN}(x) = \mathrm{Swish}_\beta(xW_G) \odot (xW_1) W_2, \quad \mathrm{Swish}_\beta(z) = z\,\sigma(\beta z)1

2.3 DPO Preference Learning for Safety

A DPO-based loss enforces the refusal of unsafe tool calls (e.g., delete_database()), with penalty coefficient ϵ=106\epsilon=10^{-6}1.

2.4 Reinforcement Learning (CodePPO)

Function-calling is cast as program synthesis. Rewards are provided for successful code execution and semantic intent matches; penalties are applied for hallucinated arguments and syntax errors.

2.5 Composite Objective

Across phases, the optimization objective is blended: ϵ=106\epsilon=10^{-6}2 where ϵ=106\epsilon=10^{-6}3 is standard next-token cross-entropy.

3. Meta-Cognition and Inverse Reasoning

The meta-cognition head (MCH) implements inverse reasoning and online failure detection.

3.1 Architecture

The MCH leverages the final transformer hidden state ϵ=106\epsilon=10^{-6}4, with a distinct projection ϵ=106\epsilon=10^{-6}5, to produce per-step confidences.

3.2 Failure Forecasting

For each candidate agentic step ϵ=106\epsilon=10^{-6}6, the MCH computes

ϵ=106\epsilon=10^{-6}7

If ϵ=106\epsilon=10^{-6}8 exceeds a threshold ϵ=106\epsilon=10^{-6}9, SAGE-32B enters a "Reasoning Mode"—invoking further look-ahead and candidate evaluation.

3.3 Inverse Consistency Score (ICS)

ICS quantifies how reasoning traces (RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma0) can reconstruct the original context (RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma1): RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma2 A dual-head architecture shares backbone features to estimate this reconstruction likelihood, typically by KL divergence or log-likelihood metrics.

3.4 Hybrid Energy Re-Ranking

Given RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma3 candidate continuations, SAGE-32B re-ranks by

RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma4

4. Agentic Loop: Task Decomposition and Error Recovery

Agentic operation in SAGE-32B proceeds as follows:

  1. Decomposition of complex requests into a DAG of atomic, dependency-annotated subtasks.
  2. Iterative agentic reasoning loop: a) Generate candidate step RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma5. b) Calculate RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma6 via MCH. c) If RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma7, perform Look-Ahead Simulation (LAS), sampling RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma8 continuations and ranking by RMSNorm(x)=x1dixi2+ϵγ\mathrm{RMSNorm}(x)=\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}} \odot \gamma9. d) Execute tool calls and observe results.
  3. On tool errors (e.g., syntax), invoke the reflective policy for in-loop critique and correction ("critique loop").

Reflective Distillation instills robust offline failure recognition and recovery while the MCH/LAS act as online verifiers during inference, minimizing cumulative error propagation.

5. Benchmarking and Empirical Performance

SAGE-32B achieves significant empirical gains over its backbone and several industry baselines, especially in multi-tool and long-horizon agentic tasks.

5.1 Reasoning and Tool Use Benchmarks

Benchmark Qwen2.5-32B SAGE-32B (Std) SAGE-32B (Think, k=32) Llama-3-70B GPT-4-Turbo
MMLU-Pro 71.5 75.6 79.3 68.9 63.7
MATH-500 78.9 78.9 91.8* 68.0 72.6
AgentBench 58.4 58.4 73.1 62.1 85.0
GPQA 50.5 48.0 48.0 51.0 53.6
IFEval 81.2 84.5 84.5 78.5 86.0

*Note: “Think” mode utilizes Majority-Vote@32 with ICS filtering, yielding MATH-500 improvements (±0.4% stdev).

5.2 Error Recovery and Efficiency

  • Internal Recovery Rate (IRR): SAGE-32B reaches 76%, doubling the Qwen2.5-32B base (35%).
  • AgentBench agentic modes:
    • Fast: 58.4% @1.2s (1.0× cost)
    • Slow: 73.1% @4.5s (3.8×)
    • Hybrid: 71.8% @1.8s (1.4×)

5.3 Tool-Calling (Enterprise-500) Suite

Model Success Rate Unforced Errors Cost/1k eps
GPT-4-Turbo 94.2% 1.5% $32.00
Claude 3.5 92.8% 2.1% $15.00
SAGE-32B Hybrid 91.5% 2.4% $4.50
Llama-3-70B 85.0% 8.2% $6.00
Qwen2.5-32B 76.4% 14.5% $2.80

Hallucination rate ablation: SAGE-32B reduces hallucinations from 14.5% (base) to 5.2% after reflective distillation, and down to 2.4% post-RL.

6. Implementation and Availability

SAGE-32B is released under a research preview license, available at https://huggingface.co/sagea-ai/sage-reasoning-32b. Key implementation hyperparameters include: 64 layers, maximum context of 128k tokens, local attention window of 4,096, landmark stride $\mathrm{FFN}(x) = \mathrm{Swish}_\beta(xW_G) \odot (xW_1) W_2, \quad \mathrm{Swish}_\beta(z) = z\,\sigma(\beta z)$0, and distillation batch size ≈256. Training utilizes 5 million synthetic multi-step trajectories with negative sampling, covering both synthetic and real-world (OSWorld/WebArena) environments. The CodePPO RL reward structure empirically balances task success with penalties for argument hallucination and syntax violations.

For use with the HuggingFace Transformers library: $\mathrm{FFN}(x) = \mathrm{Swish}_\beta(xW_G) \odot (xW_1) W_2, \quad \mathrm{Swish}_\beta(z) = z\,\sigma(\beta z)$2 A plausible implication is that SAGE-32B's design—combining inverse reasoning, hybrid training, and explicit meta-cognition—marks an advancement for agentic LLMs operating under long-horizon, multi-step planning, and tool-use settings, while maintaining competitive cost efficiency (Jha et al., 4 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAGE-32B.