AdaTooler-V: Adaptive Visual Reasoning Model

Updated 25 December 2025

AdaTooler-V is a multimodal large language model that adapts tool use through reinforcement learning to optimize visual reasoning with dynamic external tool invocation.
It uses an adaptive policy with AT-GRPO reinforcement learning guided by a Tool Benefit Score (TBS) to decide if visual tools enhance reasoning effectiveness.
State-of-the-art performance is demonstrated with AdaTooler-V-7B achieving 89.8% accuracy on high-resolution benchmarks, outperforming models like GPT-4o and Gemini 1.5 Pro.

AdaTooler-V is a multimodal LLM (MLLM) specifically engineered for adaptive tool-use in visual reasoning contexts spanning single images, multi-image composite problems, and temporally indexed video data. The principal innovation of AdaTooler-V lies in its dynamic policy for invoking external vision tools—where the model first determines whether tool-use confers genuine improvement before incurring the computational overhead of such interactions. This adaptivity is achieved through a @@@@1@@@@ protocol with reward modulation based on the Tool Benefit Score (TBS), ensuring that actions yield verifiable gains across a range of visual reasoning tasks. AdaTooler-V leverages two major datasets for staged training: AdaTooler-V-CoT-100k for supervised fine-tuning (SFT) initialization, and AdaTooler-V-300k for reinforcement learning (RL) with real-time rewards. Empirical evaluation demonstrates state-of-the-art performance, with AdaTooler-V-7B attaining 89.8% accuracy on the high-resolution V* benchmark, outperforming contemporary proprietary commercial models including GPT-4o and Gemini 1.5 Pro (Wang et al., 18 Dec 2025).

1. Motivation and Model Architecture

AdaTooler-V addresses the inefficiency observed with blind tool-use reasoning patterns in existing open-source MLLMs, which invoke vision tools irrespective of actual benefit. The model is constructed to "think in two modes": pure text-based reasoning and interleaved reasoning leveraging vision tools (e.g., cropping images, extracting video frames). The adaptive invocation policy is reinforced via AT-GRPO (Adaptive Tool-Use Gradient Reward Policy Optimization), a reinforcement learning algorithm that modulates reward scales in proportion to the sample-specific Tool Benefit Score. The backbone is based on Qwen2.5-VL-7B (for cold start), extended to interleave chain-of-thought (CoT) reasoning with tool-based visual actions, each producing explicit observations for further reasoning.

2. Dataset Construction and Annotation Schema

AdaTooler-V-CoT-100k is a high-fidelity dataset containing 100,000 multi-round CoT trajectories sourced from three modalities:

Single-image questions (61,000 samples)
Multi-image, cross-view questions (11,000 samples)
Video questions with frame-resolved tool interactions (28,000 samples)

Samples originate from AdaTooler-V-300k, itself pooled from public VQA/video-QA/image-reasoning benchmarks covering charts, OCR, mathematical diagrams, commonsense, spatial/logical reasoning, object counting, and high-resolution perception. Each sample follows a standardized annotation protocol:

Original user question
Ground-truth answer
CoT trace: alternating “Thought” tokens, explicit “Action” tokens (CropImg, FrameAt, VideoClip, PathTracer), corresponding Observation images or frames, and the final answer

Annotation is fully synthetic, generated by Qwen2.5-VL-72B-Instruct, with multi-phase rule-based filtering for semantic consistency, completeness (final answer presence), and format compliance. No human relabeling is involved; quality assurance ensures at least 95% fluency and factual consistency in the dataset.

Modality	Sample Count	Typical Resolution
Single-image	61,000	512×512, 640×480, 1024×1024
Multi-image	11,000	Varies per benchmark
Video	28,000	Avg. 480×640; up to 720×1280

3. Dataset Composition and Quantitative Profile

Task-type distribution is as follows (percentages are non-exclusive):

Math reasoning (static diagrams, formulas): 42%
Chart & data read-off: 24%
OCR/Text reading: 15%
Commonsense & knowledge VQA: 30%
Spatial & logical puzzles: 12%
Object counting/density estimation: 6%
High-res fine-grained perception: 6%

Key metrics:

Videos average 12 seconds (sampled at 32 frames, 2.5 fps, ≈12.8 s)
Single-image resolution split: 512×512 (40%), 640×480 (30%), up to 1024×1024 (30%)
Mean CoT trajectory length: 7 turns (each turn = Thought→Action→Observation)
CoT text chunk: ≈150 tokens per sample
Train/Validation/Test splits: 90,000/5,000/5,000 samples

4. Supervised Fine-Tuning (SFT) for Cold Start

SFT is used to initialize Qwen2.5-VL-7B-Instruct with consistent tool-interaction behavior preceding RL. The SFT procedure:

1 epoch over the 90,000-sample training split
Batch size: 16
Learning rate: $5 \times 10^{-5}$
Loss function: standard next-token cross-entropy across the complete thought/action/answer sequence

The objective is formalized as:

$\mathcal{L}_{\mathrm{SFT}}(\theta) = -\sum_{(x,y)\in\mathcal{D}_{\mathrm{CoT}}} \sum_{t=1}^{T}\log\,\pi_{\theta}(y_t \mid y_{<t}, x)$

where $x$ denotes the original multimodal input plus prompt and $y$ is the flattened CoT trajectory concatenated with the final answer.

5. Tool Benefit Score (TBS) and Adaptive Usage Policy

TBS quantifies the expected incremental value of invoking a vision tool for a given sample:

$\Delta S_i = S^+(q_i) - S^-(q_i)$

$S^+(q_i)$ : average accuracy of Qwen2.5-72B-Instruct on question $q_i$ with tool-use allowed $S^-(q_i)$ : same model’s accuracy without tools (text-only reasoning)

TBS is computed by 8 independent greedy decoding runs in each mode and taking the difference. Optionally, SFT sampling weights are assigned as:

$w_i = 1 + \lambda \max(0, \Delta S_i) \;\;\; \lambda = 0.5$

Samples with large positive benefit can receive up to 1.5× increased sampling probability, though empirical reshuffling remains mild (all training samples are seen once).

6. Sample Trajectories and Reasoning Patterns

AdaTooler-V-CoT-100k establishes multi-turn tool-interaction patterns for visual CoT reasoning. Example (condensed from paper figures):

Image-based (chart reading):
- Prompt: “Below is a bar-chart showing annual sales of widgets from 2018 to 2022. Question: In which year did sales first exceed 2 million units?”
- CoT: Thought to crop around labels → Action: CropImg → Observation: legend detail → Inference → Answer: “2020.”
Video-based (event counting):
- Prompt: “A red ball is dropped onto a table and bounces several times before stopping. How many bounces are visible?”
- CoT: Extract frame at time, visual event counting, frame observation → Answer: “3 bounces.”

7. Impact and Design Rationale

AdaTooler-V’s staged training—SFT on high-quality, filtered multi-modal CoT trajectories, followed by RL with adaptive reward scaling informed by sample-specific TBS—results in robust, context-sensitive sub-policies for vision tool invocation. The model empirically demonstrates strong reasoning capability, with automatic mode-switching between pure textual inference and efficient tool-based visual reasoning. A plausible implication is enhanced cost-effective deployment in real-world multimodal applications, as the model prioritizes the cheapest reasoning path that does not compromise accuracy (Wang et al., 18 Dec 2025).

Markdown Upgrade to Chat

References (1)

AdaTooler-V: Adaptive Tool-Use for Images and Videos (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdaTooler-V.