Multi-Modal Tool Learning

Updated 19 January 2026

Multi-Modal Tool Learning is a paradigm where agents integrate text, vision, audio, and sensorimotor inputs with explicit tool invocation to dynamically enhance reasoning.
It leverages architectures that combine large language/vision models with reinforcement learning, memory augmentation, and adaptive reward strategies for effective tool selection.
Experimental evaluations reveal notable accuracy improvements in tasks like image and document analysis, highlighting challenges and future directions in multi-turn and compositional reasoning.

Multi-modal tool learning refers to the class of methods and agent architectures that enable automated systems to coordinate reasoning over text, vision, audio, or sensorimotor modalities with explicit selection and use of computational or physical tools. This paradigm encompasses frameworks for teaching models not only to identify when tool use is warranted, but also to execute step-wise, context-aware sequences in which actions may include generating, modifying, or querying multimodal artifacts (e.g., images, code outputs, sensor signals). Foundational work includes reinforcement learning–based finetuning of large vision-LLMs (VLMs) for interleaved text and visual tool use (Wu et al., 25 May 2025), agent architectures for robust model selection among multimodal tools (Liu et al., 2023), scalable benchmarks for evaluating tool orchestration (Ma et al., 2024), memory-augmented selection (Xiao et al., 8 Oct 2025), and adaptive reasoning for dynamic tool invocation (Wang et al., 18 Dec 2025). The area spans domains from chatbot-driven system APIs and document analysis to embodied robotics and complex multimodal question answering.

Multi-modal tool learning is distinguished by several foundational concepts:

Multimodal Inputs and Outputs: Agents operate over input spaces that may include natural language, images, videos, audio signals, and sensorimotor data. Outputs can involve text, modified visual artifacts, trajectories, or direct tool invocation results (Wang et al., 2024, Saito et al., 2021).
Explicit Tool Invocation: Rather than rigidly processing raw data, a tool-learning agent initiates explicit calls to external functions or APIs (e.g., OCR modules, image classifiers, visual editors). Each invocation transforms or augments the context in ways (highlight, mask, crop) that advance reasoning toward the objective (Wu et al., 25 May 2025, Zou et al., 15 Dec 2025).
Reasoning–Tool Interleaving: Chains of thought (CoT) processes interleave classical reasoning steps (e.g., hypotheses, intermediate conclusions) with tool calls, producing multimodal chains (MMCoT) where each step may be textual or involve executable tool actions that alter the state or data stream (Wu et al., 25 May 2025, Ashraf et al., 9 Oct 2025).
Decision Strategies and Selection: Systems range from static tool-sets (chosen a priori) to dynamic selectors that generalize over evolving or unseen tool libraries via embedding–anchored softmax or memory–augmented approaches (Zou et al., 15 Dec 2025, Xiao et al., 8 Oct 2025, Liu et al., 2023).
Outcome-Driven Optimization: Rather than reward intermediate steps, frameworks such as VTool-R1 train only on outcome-based final accuracy, avoiding process-based reward shaping to mitigate hacking and encourage adaptive, context-sensitive tool use (Wu et al., 25 May 2025, Wang et al., 18 Dec 2025).

2. Architectures and Algorithms

Multi-modal tool learning architectures integrate LLM/VLM controllers with multimodal encoders, tool dictionaries, and (in some cases) retriever or memory modules. Key instantiations include:

Reinforcement Learning Finetuning (RFT) for Multimodal Tool Use: VTool-R1 consists of a decoder-only VLM (Qwen2.5-VL) operating under policy $\pi_\theta$ , equipped with a Python-based visual editing toolkit $T$ . The model, at each decoding step, chooses between emitting text or invoking $T$ , forming MMCoT sequences conditioned on both raw and edited images. Policy optimization balances expected task reward and a KL penalty against a reference policy via GRPO (Wu et al., 25 May 2025).
Trajectory-Based Agent Training: MATRIX introduces a pipeline for synthesizing large corpora of multimodal trajectories and verified preference pairs. Training entails supervised fine-tuning on step-wise (thought, action) records from M-TRACE, followed by direct preference optimization on step-level candidate pairs (Pref-X) (Ashraf et al., 9 Oct 2025).
Model Selection via Graph Neural Networks: The $\mathrm{M}^3$ framework embeds multi-modal inputs and candidate tools into node features, and scores possible tool assignments over a task graph via a learned GNN, ranking to maximize end-to-end success (Liu et al., 2023).
Memory-Augmented Selection: ToolMem uses an explicit repository $\mathcal{M}$ of natural-language summaries (categorized as proficient/good/bad/weak), updated and retrieved via vector similarity for tool performance prediction and context-sensitive selection (Xiao et al., 8 Oct 2025).
Adaptive RL for Tool Use: AdaTooler-V extends GRPO with per-sample Tool Benefit Scores $\Delta S_i$ , penalizing unnecessary tool calls and rewarding helpful ones, learning when to invoke visual tools adaptively based on the efficacy per query (Wang et al., 18 Dec 2025).

A table categorizing major architectures:

Framework	Tool Selection Mechanism	Multimodal Integration Approach
VTool-R1 (Wu et al., 25 May 2025)	RL policy (GRPO), MMCoT	Text & image dual-channel
MATRIX (Ashraf et al., 9 Oct 2025)	ReAct + DPO preference learning	Vision-language transformer
ToolMem (Xiao et al., 8 Oct 2025)	Memory-based contextual scoring	Embedding retrieval
AdaTooler-V (Wang et al., 18 Dec 2025)	RL w/ adaptive reward	Interleaved CoT+tool-use
M³ (Liu et al., 2023)	Task-graph GNN ranking	Node embedding fusion

3. Datasets and Benchmarks

Dataset and benchmark construction is pivotal. Notable contributions include:

ToolMMBench collected for MLLM-Tool, comprising 932 multiclass APIs across 29 coarse-grained tasks; ambiguity types and multimodal inputs are systematically annotated (Wang et al., 2024).
M-TRACE (MATRIX), a corpus of 28.5K multimodal tasks with 177K trajectories including images, code, and tables; complemented by Pref-X with 11K step-wise preference pairs (Ashraf et al., 9 Oct 2025).
MS-GQA (M³) targets robust model selection, spanning 8,426 instances with 70 model assignments each; probes resilience under increasing selection difficulty and sparse supervision (Liu et al., 2023).
AdaTooler-V-300k/CoT-100k supports RL/SFT for video and image benchmark coverage, incorporating chart understanding, math reasoning, OCR, spatial and logical tasks, and multimodal counting (Wang et al., 18 Dec 2025).
m&m’s (Benchmark), 4,427 raw tasks (882 in core eval), tests planning strategies and feedback mechanisms with 33 tools (machine-learning models, web APIs, image processing) mapped onto real input samples (Ma et al., 2024).

These datasets enable standardized reporting for tool accuracy, grounding, faithfulness, pass rate, and outcome-based success metrics.

4. Experimental Results and Analyses

Experimental findings consistently demonstrate gains from multimodal tool learning formulations:

VTool-R1 (Wu et al., 25 May 2025): Upgrades ChartQA accuracy from 51.8 % to 64.0 %, TableVQA from 41.3 % to 57.9 % with MMCoT sequences. RL finetuning with outcome-based rewards is essential; process-based penalties discourage tool use.
AdaTooler-V (Wang et al., 18 Dec 2025): Achieves 89.8 % accuracy (V* high-res), outperforming GPT-4o (65.2 %) and Gemini Pro (71.7 %) on image benchmarks. Adaptive reward leads to +4.4 pts over RL w/o tools; ablations confirm robust performance for $\alpha\in[0.4,0.8]$ .
ToolMem (Xiao et al., 8 Oct 2025): Predictive score MAE dropped 14.8 % for text, 28.7 % for image generation; tool selection accuracy improved 21–24 pp over baselines; retrieval size ablation indicates best results for $k\approx12$ .
MATRIX (Ashraf et al., 9 Oct 2025): On Agent-X, grounding improved from 0.51 (Qwen2-VL-7B) to 0.59, tool selection from 0.54 to 0.91, faithfulness from 0.41 to 0.71, and outcome success from 0.38 to 0.71. GTA and GAIA benchmarks exhibited +23 pp and +11.8 pp gains in answer accuracy.
M³ (Liu et al., 2023): On MS-GQA, successful execution rate rose from 66 % (MetaGL) to 68.7 % (M³), with superiority sustained under reduced feasible assignments and severe annotation sparsity.
MLLM-Tool (Wang et al., 2024): Llama-13B/Vicuna-13B reached 88.19 %/87.86 % top-1 accuracies; sub-analysis revealed nearly perfect accuracy for audio, $>90\%$ for image modality, and superior performance in multi-option cases.

A table summarizing key numerical comparisons:

Model	Main Metric(s)	Baseline	Result	Δ
VTool-R1	ChartQA (%), TableVQA (%)	51.8, 41.3	64.0, 57.9	+12–16 pp
AdaTooler-V	V* (high-res %)	65.2 (GPT-4o)	89.8	+24.6 pp
ToolMem	Tool Sel. Acc (%)	6–9	27–33	+21–24 pp
MATRIX	S_acc (Agent-X %)	0.38	0.71	+33 pp
M³	SER (MS-GQA %)	66.0 (MetaGL)	68.7	+2.7 pp
MLLM-Tool	Top-1 Acc (%)	83–84 (7B)	88 (13B)	+5 pp

5. Limitations and Common Failure Modes

Restricted Toolsets: Many frameworks restrict the tool library to simple visual edits, static APIs, or closed sets; generalization beyond this scope requires embedding-based selectors or zero-shot discovery (Wu et al., 25 May 2025, Zou et al., 15 Dec 2025).
Single-turn Limitation: Most studies evaluate only single-turn inference; multi-round editing, dynamic chaining, and multi-agent tool composition remain open avenues (Wang et al., 2024).
Long-tail and Data Scarcity: Rare modalities (video, sensorimotor, specialized APIs) suffer from limited high-quality annotation and imbalanced datasets (Wang et al., 2024).
Reward Hacking and Over-selection: Reward shaping can induce superficial tool invocation or preclude tool use entirely; outcome-driven or benefit-weighted RL strategies mitigate but not fully eliminate these risks (Wu et al., 25 May 2025, Wang et al., 18 Dec 2025).
Model Drift and Stale Feedback: Memory-based methods may overfit to historical capabilities, losing adaptability when tool updates outpace memory refresh rates (Xiao et al., 8 Oct 2025).
Interpretive Failures: Agents frequently misinterpret compositional instructions, execute suboptimal plans, or fail to recover from hallucinated intermediate states (Ashraf et al., 9 Oct 2025, Ma et al., 2024).

6. Future Directions and Open Research Problems

Prominent challenges and research frontiers for multi-modal tool learning include:

Enhanced Tool Libraries: Extension to richer APIs including generative models, inpainting, code synthesis, and in-the-wild sensorimotor primitives (Wu et al., 25 May 2025).
Multi-round and Compositional Reasoning: Enabling agents to execute and revise multi-turn sequences, recursively re-ingesting intermediate outputs and constructing dynamic computation graphs (Ma et al., 2024, Zou et al., 15 Dec 2025).
Scalable Supervision and Labeling: Leveraging automated LLM-based verifiers, synthesizing preference pairs, and exploring weak or semi-supervised training for large-scale trajectory data (Ashraf et al., 9 Oct 2025, Zou et al., 15 Dec 2025).
Memory Consolidation and Adaptation: Theoretical analysis of memory refresh, consolidation, and its impact on long-term tool selection accuracy; integrating human-in-the-loop calibration (Xiao et al., 8 Oct 2025).
Robustness and Model Selection: Advancing model selectors to operate per-node, semi-supervised, or via integrated LLM prompting; refining selection under runtime and data-availability constraints (Liu et al., 2023).
Adaptive Reward Models: Developing learned or ensemble-based benefit estimators, reward discriminators, and contextual evaluation for open-ended tasks or outputs (Wang et al., 18 Dec 2025).
Zero-shot and Continual Tool Discovery: Exploring mechanisms for agents to discover, learn, and use previously unseen tools without explicit retraining (Zou et al., 15 Dec 2025, Wang et al., 2024).

A plausible implication is that future progress will depend on agents that combine scalable multimodal encoder architectures, outcome- and benefit-sensitive RL, memory or embedding-based discovery mechanisms, and robust context tracking, extending tool learning from structured QA and template-driven planning to autonomous, open-ended execution in real-world domains.

Markdown Upgrade to Chat

References (9)

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use (2025)

Towards Robust Multi-Modal Reasoning via Model Selection (2023)

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks (2024)

ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory (2025)

AdaTooler-V: Adaptive Tool-Use for Images and Videos (2025)

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning (2024)

How to select and use tools? : Active Perception of Target Objects Using Multimodal Deep Learning (2021)

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning (2025)

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning (2025)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multi-Modal Tool Learning.

Multi-Modal Tool Learning

2. Architectures and Algorithms

3. Datasets and Benchmarks

4. Experimental Results and Analyses

5. Limitations and Common Failure Modes

6. Future Directions and Open Research Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Multi-Modal Tool Learning

1. Core Principles of Multi-Modal Tool Learning

2. Architectures and Algorithms

3. Datasets and Benchmarks

4. Experimental Results and Analyses

5. Limitations and Common Failure Modes

6. Future Directions and Open Research Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research