Agent0-VL: Self-Evolving Vision-Language Agent
- Agent0-VL is a self-evolving vision-language agent that integrates tool-based reasoning, verification, and self-repair to handle complex multimodal tasks.
- It alternates between Solver and Verifier roles, using structured tool calls and evidence-grounded self-evaluation to iteratively enhance performance.
- Experimental results demonstrate significant gains over baseline models, with metrics improving up to +18.1% on challenging visual and mathematical benchmarks.
Agent0-VL is a self-evolving vision-language agent designed for complex multimodal reasoning tasks by unifying tool-integrated reasoning, verification, and repair mechanisms within a single vision-LLM (VLM). Rather than relying on human-annotated supervision or external reward models, Agent0-VL employs structured, evidence-grounded self-evaluation and reinforcement learning under zero external reward, enabling continual self-improvement in domains such as geometric problem solving and scientific visual analysis. This architecture demonstrates substantial performance gains over baseline models on challenging benchmarks, notably achieving a 12.5% improvement over its Qwen2.5-VL-7B base on seven standard datasets (Liu et al., 25 Nov 2025).
1. Conceptual Foundation and Key Contributions
Agent0-VL introduces a unified agent architecture in which a single LVLM alternates between two internally synergistic roles: Solver and Verifier. The Solver executes multi-turn tool-integrated reasoning, issuing structured tool calls (e.g., Python scripts, image operations, vision API queries) and incorporating their results. The Verifier subsequently performs dense self-evaluation through tool-grounded feedback tuples, composed as . Critically, Agent0-VL implements a confidence-gated self-repair module that detects low-confidence reasoning segments and actively resamples, patching erroneous steps based on evidence.
Through the Self-Evolving Reasoning Cycle (SERC), Agent0-VL synchronizes its reasoning and verification distributions by jointly optimizing both roles with Group Relative Policy Optimization (GRPO) and internal process rewards. This design enables zero-external-reward evolution and self-alignment without dependence on human annotation or external reward models. Agent0-VL’s Verifier can also function as a stand-alone process reward model, enhancing best-of- selection strategies on sampled outputs.
2. Model Architecture and Workflow
Agent0-VL is parameterized by and alternates between Solver () and Verifier () operational modes. The unified policy is defined as:
Each reasoning episode comprises a trajectory of multi-turn reasoning, tool call execution, observation ingestion, verification, confidence assessment, and, when warranted, self-repair. The following pseudocode formalizes the dual-policy flow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Algorithm UnifiedSolverVerifier(θ):
input: instance x=(I,q)
s₁←Encode(I,q)
for t=1…T do
if mode==Solver then
aₜ~π_θ^S(·|sₜ)
if aₜ is tool-call then
oₜ← ExecuteTool(aₜ)
else
oₜ← null
end
mode←Verifier
else # Verifier
Vₜ←π_θ^V(·|sₜ,aₜ,oₜ) # yields (scoreₜ,confₜ,critiqueₜ)
mode←Solver
end
sₜ₊₁←UpdateState(sₜ,aₜ,oₜ,Vₜ)
end
return trajectory τ={(sₜ,aₜ,oₜ,Vₜ)}ₜ |
3. Self-Evolving Reasoning Cycle (SERC) and Reinforcement Learning
SERC consists of two loops:
- Inner loop: Alternates generation (Solver), verification (Verifier), and, based on feedback confidence, selective self-repair.
- Outer loop: Updates policies via GRPO, guided strictly by internal process rewards.
The RL objective optimizes returns based only on evidence-grounded feedback: where is decomposed as: with
A confidence-gating mechanism triggers self-repair when , enforcing step-level consistency and process reliability.
GRPO is formulated as: with group-normalized advantage and policy-ratio weighting ensuring stable improvement.
4. Tool-Integrated Multimodal Reasoning
At each Solver step, Agent0-VL interacts with external tools, integrating their outputs directly into the reasoning process. Supported tool modalities include:
- Python execution: Arithmetic, algebraic, and geometric computation.
- Image operations: Cropping, zooming, and OCR.
- Vision APIs: Object detection and scene text recognition.
Sample pseudocode demonstrates explicit tool invocation within the reasoning trajectory:
1 2 3 4 |
<think> 1. Need the distance between points A and B.
2. Call calculator: {"tool":"python","code":"import math; print(dist(A,B))"} </think>
→ tool_output: 5.2
<think> 3. Use 5.2 in subsequent angle calculation. </think> |
5. Self-Evaluation and Confidence-Gated Repair
Agent0-VL orchestrates self-evaluation through the Verifier, which, at each step , emits structured feedback: If , the repair module calculates a patch and instructs the Solver to re-sample the step, operationalized as: The repair cost is subtracted from the process reward, implicitly encouraging high-confidence, evidence-backed reasoning sequences.
6. Training Paradigm and Data Regimen
Agent0-VL employs a two-stage curriculum:
- Stage 1 (SFT warm-up): Supervised fine-tuning on 200\,000$ tool-augmented instances from datasets including Geometry3K, GeoQA, Mulberry, MM-Eureka, and Retool.
- Stage 2 (RL self-evolution): One epoch of GRPO with internal process rewards, using 40\,000 additional multi-turn problems from MathVerse, MathVista, WeMath, and ChartQA.
Optimization is performed strictly via self-evolved trajectories, with no external signals during RL.
7. Experimental Results and Component Analysis
Empirically, Agent0-VL-7B outperforms Qwen2.5-VL-7B across mathematical and visual reasoning benchmarks:
| Model | Math | HallBench | ChartQA | MMMU | Avg. |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 58.3 | 65.0 | 83.5 | 50.6 | 57.3 |
| Agent0-VL-7B | 65.6 | 72.9 | 87.3 | 61.1 | 65.5 |
| Δ | +7.3 | +7.9 | +3.8 | +10.5 | +8.2 |
On MathVista and WeMath, gains reach +18.1%. Iterative self-evolution further adds a monotonic +8.2% improvement across cycles. Ablation studies confirm that removing SERC degrades overall performance by 6.6%, omitting tool use incurs a 5.1% drop, and removing self-repair erases 2.6% uplift.
8. Limitations and Prospective Directions
Current reliance on sandboxed tool execution may introduce latency constraints in real-world deployments. Proposed future directions include integration with richer tool suites (such as 3D simulators), hierarchically stacked verification modules, and curriculum learning that blends external and internal rewards.
A plausible implication is that the modular self-evolving cycle and dense self-evaluation could generalize to broader multimodal agent systems, provided robust tool APIs and reliable verifier parameterization. The methodology points toward scalable, annotation-free agent improvement paradigms in vision-language reasoning (Liu et al., 25 Nov 2025).