DeepEyesV2: Agentic & Medical Imaging AI

Updated 10 November 2025

DeepEyesV2 is a dual AI system that integrates an agentic multimodal reasoning model with a mobile-friendly eyelid measurement solution using advanced vision transformers.
It employs a two-stage training pipeline—supervised bootstrapping followed by reinforcement learning—to dynamically invoke tools and enhance performance.
The eyelid measurement variant processes images in under 50 ms on smartphones, achieving high accuracy through binary encoding and specialized regularization techniques.

DeepEyesV2 refers to two distinct yet related AI systems: (1) an agentic multimodal reasoning model that integrates visual and linguistic understanding with dynamic tool-use within an iterative reasoning framework, and (2) a robust, mobile-friendly system for automated eyelid measurement leveraging a frozen DINOv2 Vision Transformer backbone with specialized encoding and regularization strategies. Each system addresses challenges in integrating perceptual, analytical, and operational capabilities, either for general real-world AI reasoning or for domain-specific medical image analysis.

1. Agentic Multimodal Model Architecture

DeepEyesV2 introduces an agentic multimodal LLM (MLLM) architecture grounded in Qwen2.5-VL-7B, which fuses a vision encoder and a LLM backbone. Key architectural components include:

Tool-Invocation Module: Embedded within the decoder, enabling output of Python code snippets (for image processing, analysis, marking), or Search API calls (text/image queries via SerpAPI). This enables the model to “act” on observations and augment its reasoning through external environment interaction.
Sandboxed Execution Environment: Generated code or API calls are executed in a controlled runtime, with structured outputs (such as image crops, numeric results, plots) fed back into the model context as new observations.
Iterative Reasoning Loop: As detailed in the model schematic, the operational cycle follows: input (image + query) → planning → optional tool invocation → execution → observation → context update → (potential further iterations) → final answer. This loop unifies perception, tool use, and language-based reasoning.

The agentic design allows DeepEyesV2 to select and compose tool-use actions dynamically as demanded by task context, setting it apart from static inference models.

2. Two-Stage Training Pipeline

2.1. Cold-Start Supervised Stage

Supervised Bootstrapping: To induce robust, syntactically correct tool-use behavior, training begins with a curated dataset comprising explicit tool calls and error-free code, collected across perception, reasoning, and search tasks.
Selection Criteria: Only examples where the base model fails in ≥75% of cases but tool use enables success are retained. Data are sourced and synthesized from Gemini 2.5 Pro, GPT-4o, and Claude Sonnet 4 outputs, filtered for successful tool-augmented resolution.
Learning Objective: Maximum likelihood training on full reasoning traces, including both the natural language and interleaved code/tool markers.

2.2. Reinforcement Learning for Agentic Behavior

Interactive Environment: The model decides sequentially whether to issue a tool call or output plain text, with rewards given only for final answer correctness and penalization of format violations:

$R(\tau) = R_{\rm acc}(\tau) + R_{\rm format}(\tau)$

where $R_{\rm acc}=1$ if the answer is correct; otherwise 0.

Optimization: PPO-style updates are executed via DAPO with KL coefficient set to 0.0, eschewing auxiliary or reward-shaping losses.
Exploration: Implicit via stochastic decoding; explicit annealing/tuning unnecessary.

The two-stage approach remedies the observed failure of direct RL alone to induce reliable tool-use, first scaffolding tool invocation and then refining it with outcome-driven optimization.

3. Task-Driven Data Curation and Diversity

Data construction for DeepEyesV2 emphasizes diversity, verifiability, and moderate task complexity. Datasets are separated into cold-start and RL pools:

Subset	Source Tasks	Episodes/Trajectories	Proportion by Task
Cold-Start	V*, ReVisual, MathCoder, PixMo, TallyQA, CoT	≈50k	perception/reasoning/search/long-CoT
RL	MMSearch-R1, VGR, Chain-of-Focus, VLM-R³	≈30k	perception 30%, reasoning 40%, search 30%

The RealX-Bench benchmark, introduced for evaluation, contains 300 QA pairs spanning five domains with multi-step, real-world, and multi-hop requirements.

4. Evaluation: RealX-Bench and Beyond

4.1. RealX-Bench

Design: Evaluates perception, search, reasoning, and their integration on challenging, objectively verifiable, real-world scenarios (e.g., consumer images, scientific charts).
Results: DeepEyesV2 achieves 28.3% overall accuracy, +6.0% over the Qwen2.5-VL-7B backbone. On the hardest “integration” subset, the gain is +8.4% (18.1% vs. 9.7%).

4.2. Benchmarks Across Modalities

DeepEyesV2 exhibits robust improvements across OCR & chart understanding, mathematical reasoning, and information-seeking tasks:

Task/Benchmark	DeepEyesV2 (%)	Qwen2.5-VL-7B (%)	Gain (%)
RealX-Bench (overall)	28.3	22.3	+6.0
MathVerse	52.7	45.6	+7.1
MMSearch	63.7	52.2	+11.5
Various (V*-Bench, HRBench, etc.)	+3 to +7	—	—

Ablation studies confirm that mixing cold-start data across perception, reasoning, and long chain-of-thought is optimal for initial training. RL data covering all task types ensures maximal, balanced performance.

This suggests that diversity and explicit tool-benefit curation are crucial drivers for agentic competence in multimodal reasoning.

5. Adaptive and Selective Tool Use

Analysis of tool-use behavior finds:

Task-Adaptivity: The model invokes cropping/marking for perception, numerical analysis code for reasoning, and search API calls for search tasks.
Reinforcement Effects: RL training decreases unconditional tool invocation (from ~90% to ~60%), focusing use on contexts providing demonstrable benefit. RL further induces complex, multi-tool trajectories never seen in supervised traces.
Examples: Complex code-plus-search solutions emerge post-RL, evidencing spontaneous composition and recurrence of agentic workflows.

This demonstrates selective, context-driven agentic behavior, as opposed to rigid or template-based tool-use.

6. Limitations and Prospective Enhancements

Key limitations include:

Execution/Selection Errors: Mis-cropped regions, buggy code, or improper tool selection (e.g., using text search where image search is needed) persist.
System Dependencies: Stability relies on sandbox execution integrity and search API latency—potential bottlenecks in real-world deployments.
Expansion Opportunities: Future directions include richer toolsets (e.g., database queries, 3D operations), hierarchical planning, and intermediate step reward shaping.

A plausible implication is that expanding the range and granularity of tool choices, coupled with more sophisticated planning/reward criteria, could further enable complex agentic reasoning.

7. Domain-Specific Application: Eyelid Measurement System

A related instantiation of DeepEyesV2 (arising from (Chen, 1 Apr 2025)) addresses biometric periocular measurement:

Architecture: Input images, captured via smartphone in multiple gaze positions and calibrated using a reference scale, are processed via a frozen DINOv2 (ViT-base or large) backbone. Features are fused using a feature pyramid network (FPN), then passed to lightweight regressors (MLP or Deep Ensemble) and a binary encoding classification head.
Encoding and Loss: Continuous measurement targets (e.g., MRD1, MRD2, LF) are quantized and encoded as K-bit binary vectors, enabling regression to be recast as multi-bit classification. Orthogonal regularization is employed to decorrelate weight matrices and encourage robust generalization; focal loss addresses class imbalance.
Performance: On held-out test data, DeepEyesV2 achieves lower MSE and higher R² on MRD2 and LF than CNN or standard ViT baselines, with the deep ensemble head giving optimal stability. The pipeline processes images in <50 ms on a midrange smartphone, making real-time, mobile deployment feasible.
Robustness and Scalability: The frozen backbone minimizes device-specific adaptation requirements and memory use, while self-supervised DINOv2 features offer consistent performance across tasks. The FPN provides spatial bias without substantial parameter inflation, and the system structure allows for trade-offs in head precision (by adjusting binary encoding length).
Limitations: Domain-specific self-supervised adaptation was limited by data scale; focal loss sometimes showed instability; and extending to 3D contours or continuous gaze tracking is anticipated for future work.

Conclusion

DeepEyesV2 represents a versatile paradigm for both agentic multimodal reasoning and robust, resource-efficient medical image analysis. The agentic model’s two-stage training (supervised bootstrapping, agentic RL) and adaptive tool-use yield substantial gains across diverse benchmarks, establishing a reference for integrated, tool-augmented multimodal intelligence. The application to eyelid measurement underscores the broader principle: leveraging strong foundational models, lightweight adaptation, and coded representations to achieve deployment-ready, accurate, and generalizable AI solutions.

PDF Markdown Chat (Pro)

References (1)

Training Frozen Feature Pyramid DINOv2 for Eyelid Measurements with Infinite Encoding and Orthogonal Regularization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DeepEyesV2.