Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Agentic Tool-Calling Framework

Updated 30 June 2025
  • Agentic Tool-Calling Framework is an integrated system where a central text-only LRM dynamically selects and invokes external tools for multi-step reasoning in long video analysis.
  • It modularizes perception by delegating tasks like video clip and subtitle retrieval to specialized tools, enabling efficient and interpretable processing.
  • Empirical results show significant performance gains on long video benchmarks, demonstrating enhanced scalability and reduced computational costs compared to traditional MLLMs.

The Agentic Tool-Calling Framework refers to integrated systems in which a central agent—typically a text-only Large Reasoning Model (LRM)—actively orchestrates problem-solving by selecting, invoking, and integrating specialized external tools over the course of a multi-step reasoning process. In the context of long video understanding, as described in "VideoDeepResearch: Long Video Understanding With Agentic Tool Using," this framework addresses the challenge of complex, multimodal inference over extended temporal sequences by dynamically delegating perception tasks to a modular multimodal toolkit, while centralizing planning and reasoning in the LRM agent. The approach diverges from previous paradigms by decomposing cognition (planning, task decomposition) and perception (visual/audio interrogation), enabling more interpretable, efficient, and effective processing of long videos.

1. Agentic Framework for Long Video Understanding

VideoDeepResearch exemplifies an agentic system where the LRM agent is responsible for receiving natural-language tasks and methodically developing a solution strategy. Rather than statically consuming the entire video or relying on context-window scaling in multimodal LLMs (MLLMs), the agent incrementally constructs understanding by:

  • Formulating intermediate information needs in natural language ("thoughts").
  • Selecting the most appropriate tool for each subtask ("actions"), such as retrieval, local analysis, or global browsing, depending on the current reasoning state and partial results.
  • Iteratively integrating information from tool outputs, updating the working memory/context after each step.
  • Concluding with a final answer when sufficient evidence is accumulated.

This process operationalizes long video understanding as a series of progressive, rational information-seeking maneuvers, rather than as a monolithic pattern-matching task.

2. Architecture: Modular Multimodal Toolkit

The agentic framework is underpinned by a suite of modular and reusable external tools:

  • Video Clip Retriever (Rv\mathcal{R}_v): Given a query qq, retrieves the most relevant pre-segmented video clips (Vret\mathcal{V}_{\text{ret}}), focusing attention and computation on salient intervals rather than exhaustive frame-by-frame analysis.
  • Subtitle Retriever (Rs\mathcal{R}_s): Searches subtitle text for segments pertinent to the query (Sret\mathcal{S}_{\text{ret}}), supporting joint speech–vision inference.
  • Visual Perceiver (Pc\mathcal{P}_c): Answers localized visual questions over targeted short clips, enabling fine-grained analysis (e.g., object recognition, action identification).
  • Subtitle Extractor (Es\mathcal{E}_s): Extracts all subtitle entries within a specified temporal window, providing temporal flexibility in aligning audio and visual reasoning.
  • Video Browser (Pb\mathcal{P}_b): Provides high-level, global content understanding via downsampled frame analysis, complementing the fine-grained tools.

The LRM’s reasoning loop is empowered to selectively—and sequentially—invoke one or more of these tools per step, composing a tailored sequence of actions per task.

3. Reasoning Model and Inference Process

At the core of the agentic process is a powerful text-only LRM, such as DeepSeek-R1. The agent operates in a structured "thought–action" loop:

  1. Initialization: Segments the video and loads subtitle/context resources.
  2. Iterative Reasoning: At each round, the LRM outputs a natural-language "thought" (reflective plan, hypothesis, or uncertainty), chooses the next "action" (tool call, further search, answer attempt), and receives the output, which is merged into the evolving context H\mathcal{H}.
  3. Termination: When the agent’s internal policy signals task completion (via an "answer" action), the result is emitted.

Algorithmically, the process is formalized as:

$\REQUIRE\ q, V, \mathcal{C}_s, \mathcal{I} \STATE \mathcal{C}_v \leftarrow ExtractClips(V) \STATE \mathcal{H} \leftarrow \{\mathcal{I}, q\},\ a = \text{None} \WHILE{a = \text{None}} (\text{thought}, \text{action}) \leftarrow LRM(\mathcal{I}) \IF{\text{action} = \text{answer}} \STATE Output a \STATE break \ELSE \STATE Execute tool, update %%%%9%%%% \ENDIF \ENDWHILE$

Each tool invocation and output is logged and possibly leveraged for future reasoning cycles, enabling sophisticated multi-hop and evidence-accumulating strategies.

4. Problem-Solving Strategies: Progressive, Multi-Hop Tool Use

Contrasting with previous static or purely retrieval-augmented paradigms, the agentic framework in VideoDeepResearch dynamically grows the active working set of video segments (X~\tilde{\mathcal{X}}):

X~={Xi1,...,Xis},R={R1,...,Rk}\tilde{\mathcal{X}} = \{X_{i_1}, ..., X_{i_s}\},\quad \mathcal{R} = \{R_1, ..., R_k\}

where each RjR_j denotes a reasoning action, such as further localization (“find where event X occurs”), focus refinement (“zoom in on segment with object Y”), or expansion (“retrieve subtitles at time interval [t_0, t_1]”). The stopping condition is learned: if the synthesized evidence suffices for solution, the agent answers; otherwise, the loop continues.

A key behavior is that relevant content is rarely known a priori—rather, it is discovered through an adaptive exploration, which significantly improves both efficiency and quality, especially on multi-step and compositional queries.

5. Evaluation and Impact

Empirical evaluation is conducted on standard long video understanding benchmarks, including MLVU, LVBench, VideoMME (long), and LongVideoBench. The agentic approach demonstrates:

  • Substantial improvements over prior state-of-the-art, e.g., +9.6% on MLVU (test), +6.6% on LVBench, +3.9% on LongVideoBench.
  • Superior performance using only 17–25% of the visual tokens processed by conventional MLLMs, indicating high computational efficiency.
  • Robustness to video length: as duration scales, performance of traditional MLLMs decays significantly (>12 points); the agentic approach loses only ~4.9 points, suggesting scalability.
  • Strongest results in fine-grained, multi-hop, or retrieval-centric subtasks, often outperforming even the largest proprietary models when paired with advanced perception components (e.g., Seed1.5VL-pro, Qwen2.5VL-7B).

6. Advantages and Design Principles

The agentic tool-calling framework, as instantiated in VideoDeepResearch, yields multiple advantages for LVU:

  • Task–Tool Decoupling: Clear separation between reasoning (LRM) and perception (specialized tools) allows leveraging state-of-the-art vision or retrieval backends without retraining the agent.
  • Efficiency: Only relevant content is retrieved and analyzed per query, keeping computational costs low and memory usage independent of total video length.
  • Generalizability: The architecture is modular; any reasoning model or visual tool can be slotted in, facilitating rapid system improvement and adaptation to new domains.
  • Interpretability: Each step in the reasoning process is explicit, enabling traceability, error analysis, and insight into the agent’s decision policies.
  • No reliance on large, hard-to-train all-in-one MLLMs: Open-source agentic systems can surpass proprietary end-to-end models on LVU benchmarks, using only modular, off-the-shelf multi-modal tools.

7. Mathematical Formalism and Inference Loop

Key mathematical relationships describe the functional interfaces between the agent and its tools:

  • Retrieved Video Clips:

Vret=Rv(qCv)\mathcal{V}_{\text{ret}} = \mathcal{R}_v(q \mid \mathcal{C}_v)

  • Subtitle Retrieval:

Sret=Rs(qCs)\mathcal{S}_{\text{ret}} = \mathcal{R}_s(q \mid \mathcal{C}_s)

  • Visual Perception of Spanned Segment:

TA=Pc(qC[t0,t1])T_A = \mathcal{P}_c(q \mid C_{[t_0, t_1]})

  • Subtitle Extraction by Time Range:

St=Es([t0,t1]Cs)\mathcal{S}_t = \mathcal{E}_s([t_0, t_1] \mid \mathcal{C}_s)

Within the iterative agent loop, these functions serve as agents' "hands and eyes": the central LRM combines, recurses on, and ultimately synthesizes the machine's output from these elemental calls.


The agentic tool-calling framework, as described in VideoDeepResearch, demonstrates that high-level reasoning and dynamic tool orchestration—rather than monolithic end-to-end neural models—can be an effective, efficient, and modular approach for tackling the inherent complexity of long video understanding. This framework enables scalable, interpretable, and state-of-the-art solutions by leveraging explicit, adaptive multi-step reasoning in conjunction with specialized perception and retrieval tools.