Agentic-Proposer-4B: Agentic LLM for Reasoning

Updated 10 February 2026

Agentic-Proposer-4B is a 4-billion parameter LLM that acts as an agentic, skill-synthesizing proposer for complex reasoning tasks.
It employs structured cognitive-action loops, trajectory-based fine-tuning, and RL techniques to dynamically select and compose reasoning skills.
Empirical results demonstrate its competitive performance in math, coding, and science, with improved tool-call accuracy and efficiency.

Agentic-Proposer-4B is a 4-billion parameter LLM designed as an agentic, skill-synthesizing proposer for complex reasoning and tool-augmented workflows. Originating from recent advances in agentic LLM research, it underpins several frameworks for automated problem synthesis, evidence-grounded question answering, and agentic trajectory generation across mathematics, coding, and scientific domains (Jiao et al., 3 Feb 2026, Yu et al., 13 Oct 2025, Bhatia et al., 12 Jan 2026). The model achieves strong empirical performance—often rivaling or surpassing larger models—through a combination of real trajectory-based supervised fine-tuning, exploration-optimized reinforcement learning, and structured cognitive-action loops for deliberative tool use.

1. Model Architecture and System Components

Agentic-Proposer-4B is built atop the Qwen3-4B-Instruct-2507 decoder-only Transformer architecture, comprising 32 layers, 4,096 hidden units per layer, and 32 attention heads, with a context window of 20,480 tokens (Jiao et al., 3 Feb 2026). The model’s system-level role is as a sequencer and planner: at each step, it conditions generation not only on dialogue history but also on an “active skill set” $\mathcal{K}_t$ derived from a precompiled skill library $\mathcal{K}_{\mathrm{self}}$ . This enables the model to dynamically select, prune, and compose atomic reasoning skills through templated cognitive actions and tool-use. When deployed in retrieval-augmented or tool-assisted settings, Agentic-Proposer-4B interoperates with deterministic tool executors (e.g., search engines, code interpreters, or information retrievers) via structured, schema-constrained outputs, often JSON-formatted.

The following table summarizes core architectural features:

Feature	Specification	Note
Base model	Qwen3-4B-Instruct-2507	Decoder-only Transformer
Layers	32
Hidden size	4,096
Attention heads	32
Context length	20,480 tokens
Skill interface	Active skill set $\mathcal{K}_t$	Dynamic, behavioral cloning and RL updated
Value head	2-layer MLP (optional, for RL)

2. Agentic Proposing and Cognitive Workflow

The Agentic Proposing framework models structured problem or trajectory synthesis as a goal-driven sequential decision-making process, formally instantiated as a partially observable Markov decision process (POMDP) $(\mathcal{S}, \mathcal{A}, \mathcal{O}, P, R, \gamma)$ (Jiao et al., 3 Feb 2026). Observations at each timestep consist of the active skill subset, evolving dialogue or draft history, and a discrete cognitive stage indicator (Draft, Check, Refine, Finalize). The action space unifies natural-language “reflection” ( $\tau^{\mathrm{think}}$ ), tool invocation ( $\tau^{\mathrm{exec}}$ , $\tau^{\mathrm{edit}}$ ), and final problem submission ( $\tau^{\mathrm{submit},q}$ ).

Internal reasoning unfolds as an iterative Draft–Check–Refine loop, in which the agent deliberates via chain-of-thought, selectively invokes external tools (e.g., code execution, symbolic validation), and adaptively composes or prunes skills as needed. Skill-conditioned sampling is guided by a mapping operator $\Phi:\mathcal{K}^n\to\mathcal{L}$ , transforming selected skills into high-dimensional constraint text for problem drafting. Deliberative policies, which produce up to $T_\mathrm{delib}$ tokens before any tool-call, yield greater tool efficiency compared to reactive, high-frequency tool-use (Yu et al., 13 Oct 2025).

3. Multi-Granularity Policy Optimization (MGPO) and RL Recipes

After supervised behavioral cloning from expert trajectories, Agentic-Proposer-4B is aligned through specialized reinforcement learning objectives. The Multi-Granularity Policy Optimization (MGPO) algorithm augments standard KL-constrained reward maximization with dual-granularity advantage: trajectory-level ( $A_E(\tau_i)$ , e.g., final solution correctness, difficulty bonus) and stage-level ( $A_S(a_t)$ , e.g., intermediate tool-call success, reflection coherence) (Jiao et al., 3 Feb 2026). The RL objective is:

$\max_{\pi_\theta} \mathbb{E}_{o,a\sim\pi_\theta}\bigl[R(o,a)\bigr] - \beta D_{\mathrm{KL}}\left[\pi_\theta(\cdot\mid o)\,\Vert\,\pi_{\mathrm{ref}}(\cdot\mid o)\right]$

This policy is implemented online using importance-weighted sample updates, an asymmetric gating on the importance ratio, and composed advantage:

$A^{\mathrm{fused}}_{i,t} = A_E(\tau_i) + \omega A_S(a_t)$

Empirically, ablations reveal that removal of stage-level signals diminishes performance by several percentage points, confirming the importance of intra-trajectory credit assignment (Jiao et al., 3 Feb 2026). Complementary RL recipes (e.g., GRPO-TCR with clip-higher and entropy regularization) further stabilize training and promote exploration efficiency (Yu et al., 13 Oct 2025).

4. Skill Library, Acquisition, and Compositional Synthesis

The “skill library” $\mathcal{K}_{\mathrm{self}}$ is a curated set of atomic reasoning modules, each annotated as a tuple $\langle\iota,\mu,\delta,\tau\rangle$ encoding semantic intent, construction method, difficulty scale, and executable tool-hint (Jiao et al., 3 Feb 2026). Skills are extracted from expert-written or teacher-LM-generated corpora via policy-guided rejection sampling. High-quality skills (passing threshold $r(k)\geq \tau_r$ under a quality scorer) are subject to behavioral cloning to instantiate robust, diverse $\mathcal{K}_{\mathrm{self}}$ .

During synthesis, Agentic-Proposer-4B composes skills into multi-faceted reasoning trajectories (e.g., combinatorial proofs blended with PDE analysis), conditionally sampling from underperforming categories as monitored by online mastery estimates. This compositional synthesis is central to generating complex, realistic training data for downstream solvers.

5. Data Regimen, Training, and RL Implementation

Agentic-Proposer-4B’s supervised and RL pipelines hinge on high-diversity, real end-to-end agentic trajectories. Supervised fine-tuning typically leverages tens of thousands of JSON-formatted instances, each capturing full “think–tool–reflect–tool–answer” chains collected using teacher models such as Qwen3-Coder-30B-A3B (Yu et al., 13 Oct 2025). Quality filtering using specialized reward models (e.g., ReasonFlux-PRM) yields substantial performance gains over synthetic, manually stitched CoT data, raising cold-start accuracy by up to 51 points (maj@32) (Yu et al., 13 Oct 2025).

RL datasets are sampled from mixed math, code, and science problems, often rebalanced by empirical difficulty. Training uses distributed AdamW with linear warmup, cosine decay, batch sizes up to 128, flash-attention and mixed-precision optimization.

A comparison of RL approaches is summarized below:

Technique	Sample Efficiency	Final Accuracy (AIME2025, avg@32)
GRPO ( $\epsilon_{high}=0.2$ )	Moderate	55%
GRPO-TCR ( $\epsilon_{high}=0.315$ )	$\approx$ 1.7 $\times$ faster	70%

6. Empirical Evaluation and Impact

Empirical results across standard benchmarks demonstrate the capability of Agentic-Proposer-4B and agent-trained downstream solvers to achieve, and in some cases exceed, the performance of much larger baseline models:

Mathematics (AIME24/25, HMMT): 4B-scale solvers trained on agentic data outperform synthetic and human-annotated data baselines, with up to +4.5 points on AIME 25 (Jiao et al., 3 Feb 2026).
Code (LiveCodeBench v6): Pass@5 improves to 26.8%, surpassing larger models (Yu et al., 13 Oct 2025).
Science (GPQA, SuperGPQA): Robust +5–7 point gains via cross-domain generalization (Jiao et al., 3 Feb 2026).
Deliberative tool-use: Agentic-Proposer-4B’s policies yield greater than 70% tool-call accuracy with fewer total invocations (Yu et al., 13 Oct 2025).

When deployed in retrieval-augmented frameworks—such as the agentic Quran-grounding setting for faith-verified Islamic QA—Agentic-Proposer-4B orchestrates propose–execute–revise loops, planning JSON tool calls, integrating dense retrieval evidence, and minimizing hallucinations. Hallucination rates drop from 45% to 23% and abstention from 16% to 4% compared to single-shot RAG, while correctness rises to near 49% and bilingual robustness is significantly improved (Bhatia et al., 12 Jan 2026).

7. Practical Implementation and Open Resources

Agentic-Proposer-4B supports efficient distributed training (DeepSpeed ZeRO-1), mixed-precision inference, and integration with agentic toolchains via structured prompts. Training can be performed on 8×A100-80 GB GPUs with wall-clock times $\leq$ 48 hours for RL (Yu et al., 13 Oct 2025). Open-source implementations and checkpoints are available at:

The framework and associated data recipes serve as a reference for building robust, resource-efficient agentic LLMs and have established new baselines for agentic reasoning benchmarks at small model scales (Yu et al., 13 Oct 2025, Jiao et al., 3 Feb 2026, Bhatia et al., 12 Jan 2026).