AdaTooler-V-CoT-100k Multimodal Reasoning Dataset

Updated 25 December 2025

AdaTooler-V-CoT-100k is a high-fidelity multimodal dataset comprising 100k curated samples that enable tool-aware chain-of-thought reasoning in visual tasks.
It encompasses diverse tasks—such as visual question answering, OCR, math diagrams, and spatial puzzles—using automatic CoT annotation and rigorous rule-based filtering.
The dataset primes the AdaTooler-V model by interleaving natural language reasoning with explicit tool actions, enhancing both efficiency and accuracy in visual reasoning.

AdaTooler-V-CoT-100k is a high-fidelity multimodal dataset comprising 100,000 curated samples for supervised fine-tuning of chain-of-thought (CoT) reasoning with explicit tool-use interaction in visual tasks. It serves as the principal cold-start corpus for AdaTooler-V, a multimodal LLM (MLLM) designed to adaptively invoke vision tools only when beneficial, improving reasoning efficiency and accuracy across static images, image pairs, and dynamic video clips. Constructed via automatic CoT annotation by Qwen2.5-VL-72B-Instruct and stringent rule-based filtering, AdaTooler-V-CoT-100k includes diverse task types that span visual question answering (VQA), chart analysis, OCR, mathematical diagrams, commonsense reasoning, and spatial-temporal puzzles. The dataset operationalizes a standardized interaction schema, interleaving natural language "Thought" steps with precise "Action" calls to vision tools and corresponding "Observation" feedback, culminating in a resolved answer, thereby priming AdaTooler-V for robust tool-aware multimodal reasoning (Wang et al., 18 Dec 2025).

1. Dataset Construction and Annotation Pipeline

AdaTooler-V-CoT-100k is sourced as a filtered subset from the broader AdaTooler-V-300k base pool, which aggregates 300,000 multimodal samples spanning single images, multi-image sets, and video clips. Generation of CoT trajectories utilizes Qwen2.5-VL-72B-Instruct, driven by a template prompt (Appendix C of the source paper) to output multi-step “thinking with images/videos” sequences. The annotation schema per sample consists of:

Raw input (single image, image pair, or video plus query)
A stepwise reasoning log as an alternating sequence:
- “Thought_i:” — natural language inference
- “Action_i: TOOL_NAME(parameters)” — explicit invocation of a vision tool (e.g., CropImg, FrameAt, VideoClip)
- “Observation_i:” — resulting image patch or video clip from tool application
A terminal “Answer” field reflecting the resolved outcome.

Quality assurance is implemented via deterministic rules: each “Action” must produce a corresponding valid “Observation”; output must end in an answer matching approved formats (multiple-choice, numerical, free-form); and trajectories exhibiting semantic inconsistency are rejected. The filtering yields the final high-quality 100,000-model SFT set.

2. Dataset Statistics and Composition

AdaTooler-V-CoT-100k mirrors the modality and task-type proportions of AdaTooler-V-300k. The estimated breakdown is:

Modality	Approximate Count	Task-Type Examples
Video	≈ 35,000	Action clips, temporal reasoning
Multi-image	≈ 20,000	Spatial/logical puzzles
Single-image	≈ 45,000	VQA, charts, OCR, math, counting

Task-type distributions closely follow those in the full 300k pool:

General image/video comprehension (VQA): ~99,000
Chart/data analysis: 24,000
OCR/transcription: 15,000
Math diagrams/problems: 42,000
Commonsense/knowledge: 30,000
Spatial/logical puzzles: 24,000
Visual counting: 6,000
High-resolution small-object identification: 6,000

For videos, average duration is 5–15 seconds (∼150–450 frames at 30fps), with CoT tool interactions typically focusing on 1–5 frame snippets or 1–2 second sub-clips. Image resolutions range from 224×224 to 4k×4k, with cropped areas for tool interactions generally 256×256–512×512 pixels.

3. Training Protocol and Objective

AdaTooler-V-CoT-100k is exclusively used for a single-pass supervised fine-tuning (SFT) run, without an explicit train/val/test split. The base MLLM initialization employs Qwen2.5-VL-7B-Instruct. Training is conducted for one epoch over the 100,000 trajectories, with batch size 16 and fixed learning rate 5e-5. The SFT loss is defined as:

$\mathcal{L}_{\mathrm{SFT}}(\theta) = -\sum_{t=1}^T \log P_\theta(m_t~|~m_{<t}, x)$

where $m_1, ..., m_T$ is the flattened token sequence (thoughts, actions, observations, answer) for one example, conditioned on input $x$ . This stage is designed to “prime” the model with structured tool-use and coherent multimodal CoT sequences prior to reinforcement learning phases.

4. Tool Benefit Score and Impact on Dataset Usage

Tool Benefit Score $(\Delta S)$ is computed for each sample as:

$\Delta S_i = S^+(q_i) - S^-(q_i)$

where $S^+(q_i)$ and $S^-(q_i)$ are the average reference model (Qwen2.5-VL-72B) accuracies across eight runs respectively with and without tool invocation. Despite this measure, AdaTooler-V-CoT-100k is sampled uniformly during SFT, with no weighting or re-sampling based on $\Delta S$ . Tool Benefit Score is solely applied for adaptive reward shaping in later RL (AT-GRPO) stages, not during initial supervised learning.

5. Task Schema and Annotation Formatting

Each sample in AdaTooler-V-CoT-100k adheres to a rigorous schema dictated by the prompt template, supporting multi-modality and tool-use traceability:

Field	Content	Annotation Constraint
Input	image/video + question	Exact modality and natural-language query
Thought_i	reasoning step (free-form)	Stepwise, interpretable, contextually linked
Action_i	TOOL_NAME(parameters)	Explicit valid tool, followed by Observation
Observation_i	resulting patch/clip	Must correspond to preceding Action
Answer	text/option	Final answer format, specification enforced

Mandatory filtering rules ensure logical and semantic coherence throughout the trajectory, with all actions, observations, and answers fully accounted for.

6. Representative Examples

Illustrative corpus entries exemplify the adaptive, tool-aware CoT reasoning structure:

Image-Based Example (High-Resolution Benchmark):

Input: High-res image with query focused on central alignment of stall icons.

Thought_1: Recognizes high resolution; proposes tool-use.
Action_1: Invokes CropImg on specific region.
Observation_1: Receives cropped patch.
Thought_2: Proceeds to inspect opposing stall.
Action_2: Crops alternate region.
Observation_2: Receives patch.
Thought_3: Judges alignment.
Answer: "Stall 4."

Video-Based Example (Temporal Reasoning Task):

Input: 5s basketball clip, query targeting player speed at a precise moment.

Thought_1: Asserts need for specific frame.
Action_1: FrameAt(2.0s).
Observation_1: Extracted frame for analysis.
Thought_2: Measures pixel displacement.
Action_2: VideoClip around target window.
Observation_2: Clip for speed estimation.
Thought_3: Computes velocities; draws conclusion.
Answer: "Player B."

This exemplifies the tool-guided CoT annotation paradigm central to AdaTooler-V-CoT-100k’s design.

7. Context, Utility, and Implications

AdaTooler-V-CoT-100k provides a critical cold-start resource for adaptive multimodal reasoning models. The tightly structured, tool-use-aware CoT format addresses prior MLLM limitations, specifically blind invocation of vision tools that increase computational overhead without substantive contribution. By “priming” AdaTooler-V for selective, contextually appropriate tool interactions, the corpus supports scalable, efficient, and verifiable visual reasoning across diverse domains. A plausible implication is that similar annotation schemas may be generalized to future multimodal SFT pipelines emphasizing tool-use rationality and interpretability (Wang et al., 18 Dec 2025). The dataset’s release, along with code and models, sets a reproducible benchmark for continued development in adaptive visual-language modeling.

PDF Markdown Chat (Pro)

References (1)

AdaTooler-V: Adaptive Tool-Use for Images and Videos (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to AdaTooler-V-CoT-100k.