Papers
Topics
Authors
Recent
2000 character limit reached

AdaTooler-V: Adaptive Visual Reasoning Model

Updated 25 December 2025
  • AdaTooler-V is a multimodal large language model that adapts tool use through reinforcement learning to optimize visual reasoning with dynamic external tool invocation.
  • It uses an adaptive policy with AT-GRPO reinforcement learning guided by a Tool Benefit Score (TBS) to decide if visual tools enhance reasoning effectiveness.
  • State-of-the-art performance is demonstrated with AdaTooler-V-7B achieving 89.8% accuracy on high-resolution benchmarks, outperforming models like GPT-4o and Gemini 1.5 Pro.

AdaTooler-V is a multimodal LLM (MLLM) specifically engineered for adaptive tool-use in visual reasoning contexts spanning single images, multi-image composite problems, and temporally indexed video data. The principal innovation of AdaTooler-V lies in its dynamic policy for invoking external vision tools—where the model first determines whether tool-use confers genuine improvement before incurring the computational overhead of such interactions. This adaptivity is achieved through a reinforcement learning protocol with reward modulation based on the Tool Benefit Score (TBS), ensuring that actions yield verifiable gains across a range of visual reasoning tasks. AdaTooler-V leverages two major datasets for staged training: AdaTooler-V-CoT-100k for supervised fine-tuning (SFT) initialization, and AdaTooler-V-300k for reinforcement learning (RL) with real-time rewards. Empirical evaluation demonstrates state-of-the-art performance, with AdaTooler-V-7B attaining 89.8% accuracy on the high-resolution V* benchmark, outperforming contemporary proprietary commercial models including GPT-4o and Gemini 1.5 Pro (Wang et al., 18 Dec 2025).

1. Motivation and Model Architecture

AdaTooler-V addresses the inefficiency observed with blind tool-use reasoning patterns in existing open-source MLLMs, which invoke vision tools irrespective of actual benefit. The model is constructed to "think in two modes": pure text-based reasoning and interleaved reasoning leveraging vision tools (e.g., cropping images, extracting video frames). The adaptive invocation policy is reinforced via AT-GRPO (Adaptive Tool-Use Gradient Reward Policy Optimization), a reinforcement learning algorithm that modulates reward scales in proportion to the sample-specific Tool Benefit Score. The backbone is based on Qwen2.5-VL-7B (for cold start), extended to interleave chain-of-thought (CoT) reasoning with tool-based visual actions, each producing explicit observations for further reasoning.

2. Dataset Construction and Annotation Schema

AdaTooler-V-CoT-100k is a high-fidelity dataset containing 100,000 multi-round CoT trajectories sourced from three modalities:

  • Single-image questions (61,000 samples)
  • Multi-image, cross-view questions (11,000 samples)
  • Video questions with frame-resolved tool interactions (28,000 samples)

Samples originate from AdaTooler-V-300k, itself pooled from public VQA/video-QA/image-reasoning benchmarks covering charts, OCR, mathematical diagrams, commonsense, spatial/logical reasoning, object counting, and high-resolution perception. Each sample follows a standardized annotation protocol:

  • Original user question
  • Ground-truth answer
  • CoT trace: alternating “Thought” tokens, explicit “Action” tokens (CropImg, FrameAt, VideoClip, PathTracer), corresponding Observation images or frames, and the final answer

Annotation is fully synthetic, generated by Qwen2.5-VL-72B-Instruct, with multi-phase rule-based filtering for semantic consistency, completeness (final answer presence), and format compliance. No human relabeling is involved; quality assurance ensures at least 95% fluency and factual consistency in the dataset.

Modality Sample Count Typical Resolution
Single-image 61,000 512×512, 640×480, 1024×1024
Multi-image 11,000 Varies per benchmark
Video 28,000 Avg. 480×640; up to 720×1280

3. Dataset Composition and Quantitative Profile

Task-type distribution is as follows (percentages are non-exclusive):

  • Math reasoning (static diagrams, formulas): 42%
  • Chart & data read-off: 24%
  • OCR/Text reading: 15%
  • Commonsense & knowledge VQA: 30%
  • Spatial & logical puzzles: 12%
  • Object counting/density estimation: 6%
  • High-res fine-grained perception: 6%

Key metrics:

  • Videos average 12 seconds (sampled at 32 frames, 2.5 fps, ≈12.8 s)
  • Single-image resolution split: 512×512 (40%), 640×480 (30%), up to 1024×1024 (30%)
  • Mean CoT trajectory length: 7 turns (each turn = Thought→Action→Observation)
  • CoT text chunk: ≈150 tokens per sample
  • Train/Validation/Test splits: 90,000/5,000/5,000 samples

4. Supervised Fine-Tuning (SFT) for Cold Start

SFT is used to initialize Qwen2.5-VL-7B-Instruct with consistent tool-interaction behavior preceding RL. The SFT procedure:

  • 1 epoch over the 90,000-sample training split
  • Batch size: 16
  • Learning rate: 5×1055 \times 10^{-5}
  • Loss function: standard next-token cross-entropy across the complete thought/action/answer sequence

The objective is formalized as:

LSFT(θ)=(x,y)DCoTt=1Tlogπθ(yty<t,x)\mathcal{L}_{\mathrm{SFT}}(\theta) = -\sum_{(x,y)\in\mathcal{D}_{\mathrm{CoT}}} \sum_{t=1}^{T}\log\,\pi_{\theta}(y_t \mid y_{<t}, x)

where xx denotes the original multimodal input plus prompt and yy is the flattened CoT trajectory concatenated with the final answer.

5. Tool Benefit Score (TBS) and Adaptive Usage Policy

TBS quantifies the expected incremental value of invoking a vision tool for a given sample:

ΔSi=S+(qi)S(qi)\Delta S_i = S^+(q_i) - S^-(q_i)

S+(qi)S^+(q_i): average accuracy of Qwen2.5-72B-Instruct on question qiq_i with tool-use allowed S(qi)S^-(q_i): same model’s accuracy without tools (text-only reasoning)

TBS is computed by 8 independent greedy decoding runs in each mode and taking the difference. Optionally, SFT sampling weights are assigned as:

wi=1+λmax(0,ΔSi)      λ=0.5w_i = 1 + \lambda \max(0, \Delta S_i) \;\;\; \lambda = 0.5

Samples with large positive benefit can receive up to 1.5× increased sampling probability, though empirical reshuffling remains mild (all training samples are seen once).

6. Sample Trajectories and Reasoning Patterns

AdaTooler-V-CoT-100k establishes multi-turn tool-interaction patterns for visual CoT reasoning. Example (condensed from paper figures):

  • Image-based (chart reading):
    • Prompt: “Below is a bar-chart showing annual sales of widgets from 2018 to 2022. Question: In which year did sales first exceed 2 million units?”
    • CoT: Thought to crop around labels → Action: CropImg → Observation: legend detail → Inference → Answer: “2020.”
  • Video-based (event counting):
    • Prompt: “A red ball is dropped onto a table and bounces several times before stopping. How many bounces are visible?”
    • CoT: Extract frame at time, visual event counting, frame observation → Answer: “3 bounces.”

7. Impact and Design Rationale

AdaTooler-V’s staged training—SFT on high-quality, filtered multi-modal CoT trajectories, followed by RL with adaptive reward scaling informed by sample-specific TBS—results in robust, context-sensitive sub-policies for vision tool invocation. The model empirically demonstrates strong reasoning capability, with automatic mode-switching between pure textual inference and efficient tool-based visual reasoning. A plausible implication is enhanced cost-effective deployment in real-world multimodal applications, as the model prioritizes the cheapest reasoning path that does not compromise accuracy (Wang et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AdaTooler-V.