Agent0-VL: Self-Evolving Vision-Language Agent

Updated 26 November 2025

Agent0-VL is a self-evolving vision-language agent that integrates tool-based reasoning, verification, and self-repair to handle complex multimodal tasks.
It alternates between Solver and Verifier roles, using structured tool calls and evidence-grounded self-evaluation to iteratively enhance performance.
Experimental results demonstrate significant gains over baseline models, with metrics improving up to +18.1% on challenging visual and mathematical benchmarks.

Agent0-VL is a self-evolving vision-language agent designed for complex multimodal reasoning tasks by unifying tool-integrated reasoning, verification, and repair mechanisms within a single vision-LLM (VLM). Rather than relying on human-annotated supervision or external reward models, Agent0-VL employs structured, evidence-grounded self-evaluation and reinforcement learning under zero external reward, enabling continual self-improvement in domains such as geometric problem solving and scientific visual analysis. This architecture demonstrates substantial performance gains over baseline models on challenging benchmarks, notably achieving a 12.5% improvement over its Qwen2.5-VL-7B base on seven standard datasets (Liu et al., 25 Nov 2025).

1. Conceptual Foundation and Key Contributions

Agent0-VL introduces a unified agent architecture in which a single LVLM alternates between two internally synergistic roles: Solver and Verifier. The Solver executes multi-turn tool-integrated reasoning, issuing structured tool calls (e.g., Python scripts, image operations, vision API queries) and incorporating their results. The Verifier subsequently performs dense self-evaluation through tool-grounded feedback tuples, composed as $(\mathrm{score}_t, \mathrm{conf}_t, \mathrm{critique}_t)$ . Critically, Agent0-VL implements a confidence-gated self-repair module that detects low-confidence reasoning segments and actively resamples, patching erroneous steps based on evidence.

Through the Self-Evolving Reasoning Cycle (SERC), Agent0-VL synchronizes its reasoning and verification distributions by jointly optimizing both roles with Group Relative Policy Optimization (GRPO) and internal process rewards. This design enables zero-external-reward evolution and self-alignment without dependence on human annotation or external reward models. Agent0-VL’s Verifier can also function as a stand-alone process reward model, enhancing best-of- $n$ selection strategies on sampled outputs.

2. Model Architecture and Workflow

Agent0-VL is parameterized by $\theta$ and alternates between Solver ( $m=\mathrm{S}$ ) and Verifier ( $m=\mathrm{V}$ ) operational modes. The unified policy is defined as: $\pi_\theta(a_t|s_t,m) = \begin{cases} \pi_\theta^{\rm S}(a_t|s_t) & m={\rm S} \ \pi_\theta^{\rm V}(a_t|s_t,a_t,o_t) & m={\rm V} \end{cases}$

Each reasoning episode comprises a trajectory of multi-turn reasoning, tool call execution, observation ingestion, verification, confidence assessment, and, when warranted, self-repair. The following pseudocode formalizes the dual-policy flow:

Algorithm UnifiedSolverVerifier(θ):
  input: instance x=(I,q)
  s₁←Encode(I,q)
  for t=1…T do
    if mode==Solver then
      aₜ~π_θ^S(·|sₜ)
      if aₜ is tool-call then
        oₜ← ExecuteTool(aₜ)
      else
        oₜ← null
      end
      mode←Verifier
    else  # Verifier
      Vₜ←π_θ^V(·|sₜ,aₜ,oₜ)  # yields (scoreₜ,confₜ,critiqueₜ)
      mode←Solver
    end
    sₜ₊₁←UpdateState(sₜ,aₜ,oₜ,Vₜ)
  end
  return trajectory τ={(sₜ,aₜ,oₜ,Vₜ)}ₜ

This architecture facilitates elaborate introspection and reject/resample workflows, converting initial errors into correct final answers through iterative refinement.

3. Self-Evolving Reasoning Cycle (SERC) and Reinforcement Learning

SERC consists of two loops:

Inner loop: Alternates generation (Solver), verification (Verifier), and, based on feedback confidence, selective self-repair.
Outer loop: Updates policies via GRPO, guided strictly by internal process rewards.

The RL objective optimizes returns based only on evidence-grounded feedback: $J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}\big[g(\tau)\big], \quad g(\tau)=\alpha_{\rm out}\,r_{\rm out}+\sum_{t=1}^T\gamma^{t-1}r_t$ where $r_t$ is decomposed as: $r_t = r^{(t)}_{\rm proc} \;-\; g_t\cdot C_{\rm repair}^{(t)}$ with

$r^{(t)}_{\rm proc} =\lambda_{\rm tool}\,r(\mathrm{tool}_t) +\mathrm{score}_t\cdot\mathrm{conf}_t -\beta_{\rm div}\,D_{\rm KL}\!\bigl(\pi_\theta^{\rm V}\Vert\pi_\theta^{\rm S}\bigr)$

A confidence-gating mechanism triggers self-repair when $\mathrm{conf}_t < \tau_c$ , enforcing step-level consistency and process reliability.

GRPO is formulated as: $\mathcal{L}_{\rm EDLP} = -\mathbb{E}_i\Big[\min(\rho_i\hat A_i,\;\mathrm{clip}(\rho_i,1-\epsilon,1+\epsilon)\hat A_i)\Big] +\beta_{\rm KL}\,D_{\rm KL}(\pi_\theta\Vert\pi_{\theta_{\rm old}})$ with group-normalized advantage and policy-ratio weighting ensuring stable improvement.

4. Tool-Integrated Multimodal Reasoning

At each Solver step, Agent0-VL interacts with external tools, integrating their outputs directly into the reasoning process. Supported tool modalities include:

Python execution: Arithmetic, algebraic, and geometric computation.
Image operations: Cropping, zooming, and OCR.
Vision APIs: Object detection and scene text recognition.

Sample pseudocode demonstrates explicit tool invocation within the reasoning trajectory:

<think> 1. Need the distance between points A and B.
2. Call calculator: {"tool":"python","code":"import math; print(dist(A,B))"} </think>
→ tool_output: 5.2
<think> 3. Use 5.2 in subsequent angle calculation. </think>

This suggests that the evidence-grounding offered by tool calls is central in reducing evaluation hallucinations and increasing solve reliability.

5. Self-Evaluation and Confidence-Gated Repair

Agent0-VL orchestrates self-evaluation through the Verifier, which, at each step $t$ , emits structured feedback: $V_t=(\mathrm{score}_t,\mathrm{conf}_t,\mathrm{critique}_t)$ If $\mathrm{conf}_t<\tau_c$ , the repair module calculates a patch $\Delta_t$ and instructs the Solver to re-sample the step, operationalized as: $a'_t\sim\pi_\theta\bigl(\cdot\mid s_t,\Delta_t,m=\rm S\bigr)$ The repair cost $C_{\rm repair}^{(t)}$ is subtracted from the process reward, implicitly encouraging high-confidence, evidence-backed reasoning sequences.

6. Training Paradigm and Data Regimen

Agent0-VL employs a two-stage curriculum:

Stage 1 (SFT warm-up): Supervised fine-tuning on $\sim$ 200\,000$ tool-augmented instances from datasets including Geometry3K, GeoQA, Mulberry, MM-Eureka, and Retool.
Stage 2 (RL self-evolution): One epoch of GRPO with internal process rewards, using 40\,000 additional multi-turn problems from MathVerse, MathVista, WeMath, and ChartQA.

Optimization is performed strictly via self-evolved trajectories, with no external signals during RL.

7. Experimental Results and Component Analysis

Empirically, Agent0-VL-7B outperforms Qwen2.5-VL-7B across mathematical and visual reasoning benchmarks:

Model	Math	HallBench	ChartQA	MMMU	Avg.
Qwen2.5-VL-7B	58.3	65.0	83.5	50.6	57.3
Agent0-VL-7B	65.6	72.9	87.3	61.1	65.5
Δ	+7.3	+7.9	+3.8	+10.5	+8.2

On MathVista and WeMath, gains reach +18.1%. Iterative self-evolution further adds a monotonic +8.2% improvement across cycles. Ablation studies confirm that removing SERC degrades overall performance by 6.6%, omitting tool use incurs a 5.1% drop, and removing self-repair erases 2.6% uplift.

8. Limitations and Prospective Directions

Current reliance on sandboxed tool execution may introduce latency constraints in real-world deployments. Proposed future directions include integration with richer tool suites (such as 3D simulators), hierarchically stacked verification modules, and curriculum learning that blends external and internal rewards.

A plausible implication is that the modular self-evolving cycle and dense self-evaluation could generalize to broader multimodal agent systems, provided robust tool APIs and reliable verifier parameterization. The methodology points toward scalable, annotation-free agent improvement paradigms in vision-language reasoning (Liu et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning (2025)

Agent0-VL: Self-Evolving Vision-Language Agent

1. Conceptual Foundation and Key Contributions

2. Model Architecture and Workflow

3. Self-Evolving Reasoning Cycle (SERC) and Reinforcement Learning

4. Tool-Integrated Multimodal Reasoning

5. Self-Evaluation and Confidence-Gated Repair

6. Training Paradigm and Data Regimen

7. Experimental Results and Component Analysis

8. Limitations and Prospective Directions

Whiteboard

Follow Topic

Continue Learning

Agent0-VL: Self-Evolving Vision-Language Agent

1. Conceptual Foundation and Key Contributions

2. Model Architecture and Workflow

3. Self-Evolving Reasoning Cycle (SERC) and Reinforcement Learning

4. Tool-Integrated Multimodal Reasoning

5. Self-Evaluation and Confidence-Gated Repair

6. Training Paradigm and Data Regimen

7. Experimental Results and Component Analysis

8. Limitations and Prospective Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics