Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 15 tok/s
GPT-5 High 16 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 188 tok/s Pro
2000 character limit reached

VCA: Video Curious Agent for Long Video Understanding (2412.10471v2)

Published 12 Dec 2024 in cs.CV and cs.AI

Abstract: Long video understanding poses unique challenges due to their temporal complexity and low information density. Recent works address this task by sampling numerous frames or incorporating auxiliary tools using LLMs, both of which result in high computational costs. In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as VCA. Built upon VLMs, VCA autonomously navigates video segments and efficiently builds a comprehensive understanding of complex video sequences. Instead of directly sampling frames, VCA employs a tree-search structure to explore video segments and collect frames. Rather than relying on external feedback or reward, VCA leverages VLM's self-generated intrinsic reward to guide its exploration, enabling it to capture the most crucial information for reasoning. Experimental results on multiple long video benchmarks demonstrate our approach's superior effectiveness and efficiency.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a curiosity-driven agent that uses tree-search exploration to selectively sample relevant video segments.
  • It integrates an intrinsic reward model with memory management to efficiently capture crucial visual details without external feedback.
  • Experimental results on benchmarks show superior accuracy and efficiency compared to uniform sampling and alternative agent-based methods.

This paper introduces VCA (Video Curious Agent) (2412.10471), a novel framework designed to address the challenges of understanding long videos, such as temporal complexity and low information density. Unlike traditional methods that rely on dense or uniform frame sampling, which are often inefficient and computationally expensive, VCA adopts a curiosity-driven, self-exploration approach inspired by human selective attention and working memory.

The core idea behind VCA is to treat the video as an environment to be explored actively, rather than passively processing sampled frames. The framework is built upon Vision-LLMs (VLMs) and comprises three main components:

  1. Tree-search Exploration: VCA utilizes a segment-based tree-search structure to navigate the video content. Starting from the entire video as the root, the agent iteratively samples frames from a selected segment to create finer sub-segments. This allows for adaptive exploration, focusing on potentially relevant parts of the video in a coarse-to-fine manner.
  2. Intrinsic Reward Model: To guide the exploration, VCA employs a reward model (implemented using the same VLM as the agent) that scores the relevance of each sub-segment to the user query. This self-generated intrinsic reward allows the agent to prioritize segments that are likely to contain crucial information, without relying on external feedback. Chain-of-thought prompting is used to encourage the reward model to explain its scoring. Historical reward scores are incorporated to maintain consistency.
  3. Memory Management: A fixed-size memory buffer is used to store the most relevant frames encountered during exploration. When new frames are added and the buffer exceeds its limit, frames with the lowest relevance scores are discarded. Although frames are removed, their visual information, encoded as text descriptions by the VLM, is retained in the exploration history, enabling reasoning within resource constraints and preventing memory overflow.

The exploration agent, also a VLM, uses the candidate segments (with their reward scores) and the frames in the memory buffer to decide its next action. It can either select a segment for deeper exploration or, if sufficient information is gathered, output the final answer to the query. The agent is designed to use the reward scores as guidance but can make independent decisions, including backtracking to earlier segments if a current path appears suboptimal. This balances exploitation of high-reward segments with exploration of potentially overlooked areas.

VCA is presented as a training-free and flexible framework compatible with any VLM. The authors demonstrate its effectiveness using GPT-4o and show that its benefits extend to open-source models like Qwen2-VL (using a hybrid approach where a video assistant VLM extracts visual info and an instruction LLM selects segments).

Experimental results on long video benchmarks like EgoSchema and LVBench show that VCA achieves superior accuracy compared to baseline VLLMs that use uniform sampling, while observing significantly fewer frames. It also outperforms other agent-based methods like VideoAgent and VideoTree (even when these are re-implemented using GPT-4o for a fair comparison). The ablation studies confirm the crucial contributions of both the tree-search exploration and the reward model to the framework's performance and efficiency.

Analysis of VCA's behavior indicates that the intrinsic reward model provides reliable guidance, often highlighting segments close to the ground truth. The agent's independent decision-making process further refines these choices, leading to selections that are even more aligned with the ground truth. Case studies illustrate the agent's adaptive exploration, including backtracking. Common failure modes include difficulty with subtle visual details, being misled by inaccurate reward scores, and limitations in the VLM's multi-modal reasoning even when relevant information is observed. The authors show that performance significantly improves when provided with ground truth relevance scores, suggesting that developing more accurate, specialized reward models is a promising direction for future work.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube