Papers
Topics
Authors
Recent
2000 character limit reached

Agent0-VL: Self-Evolving Vision-Language Agent

Updated 26 November 2025
  • Agent0-VL is a self-evolving vision-language agent that integrates tool-based reasoning, verification, and self-repair to handle complex multimodal tasks.
  • It alternates between Solver and Verifier roles, using structured tool calls and evidence-grounded self-evaluation to iteratively enhance performance.
  • Experimental results demonstrate significant gains over baseline models, with metrics improving up to +18.1% on challenging visual and mathematical benchmarks.

Agent0-VL is a self-evolving vision-language agent designed for complex multimodal reasoning tasks by unifying tool-integrated reasoning, verification, and repair mechanisms within a single vision-LLM (VLM). Rather than relying on human-annotated supervision or external reward models, Agent0-VL employs structured, evidence-grounded self-evaluation and reinforcement learning under zero external reward, enabling continual self-improvement in domains such as geometric problem solving and scientific visual analysis. This architecture demonstrates substantial performance gains over baseline models on challenging benchmarks, notably achieving a 12.5% improvement over its Qwen2.5-VL-7B base on seven standard datasets (Liu et al., 25 Nov 2025).

1. Conceptual Foundation and Key Contributions

Agent0-VL introduces a unified agent architecture in which a single LVLM alternates between two internally synergistic roles: Solver and Verifier. The Solver executes multi-turn tool-integrated reasoning, issuing structured tool calls (e.g., Python scripts, image operations, vision API queries) and incorporating their results. The Verifier subsequently performs dense self-evaluation through tool-grounded feedback tuples, composed as (scoret,conft,critiquet)(\mathrm{score}_t, \mathrm{conf}_t, \mathrm{critique}_t). Critically, Agent0-VL implements a confidence-gated self-repair module that detects low-confidence reasoning segments and actively resamples, patching erroneous steps based on evidence.

Through the Self-Evolving Reasoning Cycle (SERC), Agent0-VL synchronizes its reasoning and verification distributions by jointly optimizing both roles with Group Relative Policy Optimization (GRPO) and internal process rewards. This design enables zero-external-reward evolution and self-alignment without dependence on human annotation or external reward models. Agent0-VL’s Verifier can also function as a stand-alone process reward model, enhancing best-of-nn selection strategies on sampled outputs.

2. Model Architecture and Workflow

Agent0-VL is parameterized by θ\theta and alternates between Solver (m=Sm=\mathrm{S}) and Verifier (m=Vm=\mathrm{V}) operational modes. The unified policy is defined as: πθ(atst,m)={πθS(atst)m=S πθV(atst,at,ot)m=V\pi_\theta(a_t|s_t,m) = \begin{cases} \pi_\theta^{\rm S}(a_t|s_t) & m={\rm S} \ \pi_\theta^{\rm V}(a_t|s_t,a_t,o_t) & m={\rm V} \end{cases}

Each reasoning episode comprises a trajectory of multi-turn reasoning, tool call execution, observation ingestion, verification, confidence assessment, and, when warranted, self-repair. The following pseudocode formalizes the dual-policy flow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Algorithm UnifiedSolverVerifier(θ):
  input: instance x=(I,q)
  s₁←Encode(I,q)
  for t=1…T do
    if mode==Solver then
      aₜ~π_θ^S(·|sₜ)
      if aₜ is tool-call then
        oₜ← ExecuteTool(aₜ)
      else
        oₜ← null
      end
      mode←Verifier
    else  # Verifier
      Vₜ←π_θ^V(·|sₜ,aₜ,oₜ)  # yields (scoreₜ,confₜ,critiqueₜ)
      mode←Solver
    end
    sₜ₊₁←UpdateState(sₜ,aₜ,oₜ,Vₜ)
  end
  return trajectory τ={(sₜ,aₜ,oₜ,Vₜ)}ₜ
This architecture facilitates elaborate introspection and reject/resample workflows, converting initial errors into correct final answers through iterative refinement.

3. Self-Evolving Reasoning Cycle (SERC) and Reinforcement Learning

SERC consists of two loops:

  • Inner loop: Alternates generation (Solver), verification (Verifier), and, based on feedback confidence, selective self-repair.
  • Outer loop: Updates policies via GRPO, guided strictly by internal process rewards.

The RL objective optimizes returns based only on evidence-grounded feedback: J(θ)=Eτπθ[g(τ)],g(τ)=αoutrout+t=1Tγt1rtJ(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}\big[g(\tau)\big], \quad g(\tau)=\alpha_{\rm out}\,r_{\rm out}+\sum_{t=1}^T\gamma^{t-1}r_t where rtr_t is decomposed as: rt=rproc(t)    gtCrepair(t)r_t = r^{(t)}_{\rm proc} \;-\; g_t\cdot C_{\rm repair}^{(t)} with

rproc(t)=λtoolr(toolt)+scoretconftβdivDKL ⁣(πθVπθS)r^{(t)}_{\rm proc} =\lambda_{\rm tool}\,r(\mathrm{tool}_t) +\mathrm{score}_t\cdot\mathrm{conf}_t -\beta_{\rm div}\,D_{\rm KL}\!\bigl(\pi_\theta^{\rm V}\Vert\pi_\theta^{\rm S}\bigr)

A confidence-gating mechanism triggers self-repair when conft<τc\mathrm{conf}_t < \tau_c, enforcing step-level consistency and process reliability.

GRPO is formulated as: LEDLP=Ei[min(ρiA^i,  clip(ρi,1ϵ,1+ϵ)A^i)]+βKLDKL(πθπθold)\mathcal{L}_{\rm EDLP} = -\mathbb{E}_i\Big[\min(\rho_i\hat A_i,\;\mathrm{clip}(\rho_i,1-\epsilon,1+\epsilon)\hat A_i)\Big] +\beta_{\rm KL}\,D_{\rm KL}(\pi_\theta\Vert\pi_{\theta_{\rm old}}) with group-normalized advantage and policy-ratio weighting ensuring stable improvement.

4. Tool-Integrated Multimodal Reasoning

At each Solver step, Agent0-VL interacts with external tools, integrating their outputs directly into the reasoning process. Supported tool modalities include:

  • Python execution: Arithmetic, algebraic, and geometric computation.
  • Image operations: Cropping, zooming, and OCR.
  • Vision APIs: Object detection and scene text recognition.

Sample pseudocode demonstrates explicit tool invocation within the reasoning trajectory:

1
2
3
4
<think> 1. Need the distance between points A and B.
2. Call calculator: {"tool":"python","code":"import math; print(dist(A,B))"} </think>
→ tool_output: 5.2
<think> 3. Use 5.2 in subsequent angle calculation. </think>
This suggests that the evidence-grounding offered by tool calls is central in reducing evaluation hallucinations and increasing solve reliability.

5. Self-Evaluation and Confidence-Gated Repair

Agent0-VL orchestrates self-evaluation through the Verifier, which, at each step tt, emits structured feedback: Vt=(scoret,conft,critiquet)V_t=(\mathrm{score}_t,\mathrm{conf}_t,\mathrm{critique}_t) If conft<τc\mathrm{conf}_t<\tau_c, the repair module calculates a patch Δt\Delta_t and instructs the Solver to re-sample the step, operationalized as: atπθ(st,Δt,m=S)a'_t\sim\pi_\theta\bigl(\cdot\mid s_t,\Delta_t,m=\rm S\bigr) The repair cost Crepair(t)C_{\rm repair}^{(t)} is subtracted from the process reward, implicitly encouraging high-confidence, evidence-backed reasoning sequences.

6. Training Paradigm and Data Regimen

Agent0-VL employs a two-stage curriculum:

  • Stage 1 (SFT warm-up): Supervised fine-tuning on \sim200\,000$ tool-augmented instances from datasets including Geometry3K, GeoQA, Mulberry, MM-Eureka, and Retool.
  • Stage 2 (RL self-evolution): One epoch of GRPO with internal process rewards, using 40\,000 additional multi-turn problems from MathVerse, MathVista, WeMath, and ChartQA.

Optimization is performed strictly via self-evolved trajectories, with no external signals during RL.

7. Experimental Results and Component Analysis

Empirically, Agent0-VL-7B outperforms Qwen2.5-VL-7B across mathematical and visual reasoning benchmarks:

Model Math HallBench ChartQA MMMU Avg.
Qwen2.5-VL-7B 58.3 65.0 83.5 50.6 57.3
Agent0-VL-7B 65.6 72.9 87.3 61.1 65.5
Δ +7.3 +7.9 +3.8 +10.5 +8.2

On MathVista and WeMath, gains reach +18.1%. Iterative self-evolution further adds a monotonic +8.2% improvement across cycles. Ablation studies confirm that removing SERC degrades overall performance by 6.6%, omitting tool use incurs a 5.1% drop, and removing self-repair erases 2.6% uplift.

8. Limitations and Prospective Directions

Current reliance on sandboxed tool execution may introduce latency constraints in real-world deployments. Proposed future directions include integration with richer tool suites (such as 3D simulators), hierarchically stacked verification modules, and curriculum learning that blends external and internal rewards.

A plausible implication is that the modular self-evolving cycle and dense self-evaluation could generalize to broader multimodal agent systems, provided robust tool APIs and reliable verifier parameterization. The methodology points toward scalable, annotation-free agent improvement paradigms in vision-language reasoning (Liu et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Agent0-VL.