AdaTooler-V: Adaptive Visual Reasoning Model
- AdaTooler-V is a multimodal large language model that adapts tool use through reinforcement learning to optimize visual reasoning with dynamic external tool invocation.
- It uses an adaptive policy with AT-GRPO reinforcement learning guided by a Tool Benefit Score (TBS) to decide if visual tools enhance reasoning effectiveness.
- State-of-the-art performance is demonstrated with AdaTooler-V-7B achieving 89.8% accuracy on high-resolution benchmarks, outperforming models like GPT-4o and Gemini 1.5 Pro.
AdaTooler-V is a multimodal LLM (MLLM) specifically engineered for adaptive tool-use in visual reasoning contexts spanning single images, multi-image composite problems, and temporally indexed video data. The principal innovation of AdaTooler-V lies in its dynamic policy for invoking external vision tools—where the model first determines whether tool-use confers genuine improvement before incurring the computational overhead of such interactions. This adaptivity is achieved through a reinforcement learning protocol with reward modulation based on the Tool Benefit Score (TBS), ensuring that actions yield verifiable gains across a range of visual reasoning tasks. AdaTooler-V leverages two major datasets for staged training: AdaTooler-V-CoT-100k for supervised fine-tuning (SFT) initialization, and AdaTooler-V-300k for reinforcement learning (RL) with real-time rewards. Empirical evaluation demonstrates state-of-the-art performance, with AdaTooler-V-7B attaining 89.8% accuracy on the high-resolution V* benchmark, outperforming contemporary proprietary commercial models including GPT-4o and Gemini 1.5 Pro (Wang et al., 18 Dec 2025).
1. Motivation and Model Architecture
AdaTooler-V addresses the inefficiency observed with blind tool-use reasoning patterns in existing open-source MLLMs, which invoke vision tools irrespective of actual benefit. The model is constructed to "think in two modes": pure text-based reasoning and interleaved reasoning leveraging vision tools (e.g., cropping images, extracting video frames). The adaptive invocation policy is reinforced via AT-GRPO (Adaptive Tool-Use Gradient Reward Policy Optimization), a reinforcement learning algorithm that modulates reward scales in proportion to the sample-specific Tool Benefit Score. The backbone is based on Qwen2.5-VL-7B (for cold start), extended to interleave chain-of-thought (CoT) reasoning with tool-based visual actions, each producing explicit observations for further reasoning.
2. Dataset Construction and Annotation Schema
AdaTooler-V-CoT-100k is a high-fidelity dataset containing 100,000 multi-round CoT trajectories sourced from three modalities:
- Single-image questions (61,000 samples)
- Multi-image, cross-view questions (11,000 samples)
- Video questions with frame-resolved tool interactions (28,000 samples)
Samples originate from AdaTooler-V-300k, itself pooled from public VQA/video-QA/image-reasoning benchmarks covering charts, OCR, mathematical diagrams, commonsense, spatial/logical reasoning, object counting, and high-resolution perception. Each sample follows a standardized annotation protocol:
- Original user question
- Ground-truth answer
- CoT trace: alternating “Thought” tokens, explicit “Action” tokens (CropImg, FrameAt, VideoClip, PathTracer), corresponding Observation images or frames, and the final answer
Annotation is fully synthetic, generated by Qwen2.5-VL-72B-Instruct, with multi-phase rule-based filtering for semantic consistency, completeness (final answer presence), and format compliance. No human relabeling is involved; quality assurance ensures at least 95% fluency and factual consistency in the dataset.
| Modality | Sample Count | Typical Resolution |
|---|---|---|
| Single-image | 61,000 | 512×512, 640×480, 1024×1024 |
| Multi-image | 11,000 | Varies per benchmark |
| Video | 28,000 | Avg. 480×640; up to 720×1280 |
3. Dataset Composition and Quantitative Profile
Task-type distribution is as follows (percentages are non-exclusive):
- Math reasoning (static diagrams, formulas): 42%
- Chart & data read-off: 24%
- OCR/Text reading: 15%
- Commonsense & knowledge VQA: 30%
- Spatial & logical puzzles: 12%
- Object counting/density estimation: 6%
- High-res fine-grained perception: 6%
Key metrics:
- Videos average 12 seconds (sampled at 32 frames, 2.5 fps, ≈12.8 s)
- Single-image resolution split: 512×512 (40%), 640×480 (30%), up to 1024×1024 (30%)
- Mean CoT trajectory length: 7 turns (each turn = Thought→Action→Observation)
- CoT text chunk: ≈150 tokens per sample
- Train/Validation/Test splits: 90,000/5,000/5,000 samples
4. Supervised Fine-Tuning (SFT) for Cold Start
SFT is used to initialize Qwen2.5-VL-7B-Instruct with consistent tool-interaction behavior preceding RL. The SFT procedure:
- 1 epoch over the 90,000-sample training split
- Batch size: 16
- Learning rate:
- Loss function: standard next-token cross-entropy across the complete thought/action/answer sequence
The objective is formalized as:
where denotes the original multimodal input plus prompt and is the flattened CoT trajectory concatenated with the final answer.
5. Tool Benefit Score (TBS) and Adaptive Usage Policy
TBS quantifies the expected incremental value of invoking a vision tool for a given sample:
: average accuracy of Qwen2.5-72B-Instruct on question with tool-use allowed : same model’s accuracy without tools (text-only reasoning)
TBS is computed by 8 independent greedy decoding runs in each mode and taking the difference. Optionally, SFT sampling weights are assigned as:
Samples with large positive benefit can receive up to 1.5× increased sampling probability, though empirical reshuffling remains mild (all training samples are seen once).
6. Sample Trajectories and Reasoning Patterns
AdaTooler-V-CoT-100k establishes multi-turn tool-interaction patterns for visual CoT reasoning. Example (condensed from paper figures):
- Image-based (chart reading):
- Prompt: “Below is a bar-chart showing annual sales of widgets from 2018 to 2022. Question: In which year did sales first exceed 2 million units?”
- CoT: Thought to crop around labels → Action: CropImg → Observation: legend detail → Inference → Answer: “2020.”
- Video-based (event counting):
- Prompt: “A red ball is dropped onto a table and bounces several times before stopping. How many bounces are visible?”
- CoT: Extract frame at time, visual event counting, frame observation → Answer: “3 bounces.”
7. Impact and Design Rationale
AdaTooler-V’s staged training—SFT on high-quality, filtered multi-modal CoT trajectories, followed by RL with adaptive reward scaling informed by sample-specific TBS—results in robust, context-sensitive sub-policies for vision tool invocation. The model empirically demonstrates strong reasoning capability, with automatic mode-switching between pure textual inference and efficient tool-based visual reasoning. A plausible implication is enhanced cost-effective deployment in real-world multimodal applications, as the model prioritizes the cheapest reasoning path that does not compromise accuracy (Wang et al., 18 Dec 2025).