Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning (2506.13654v1)

Published 16 Jun 2025 in cs.CV and cs.AI

Abstract: We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained LLM using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.

Summary

The paper introduces a novel Chain-of-Tool-Thought framework that enables modular, dynamic reasoning over ultra-long egocentric videos.
It employs a two-stage training paradigm combining supervised finetuning and reinforcement learning to optimize step-by-step tool selection.
The approach integrates specialized models for temporal retrieval and detailed visual analysis, achieving scalable video understanding.

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

The paper introduces Ego-R1, a novel framework designed for reasoning over ultra-long egocentric videos, leveraging a structured Chain-of-Tool-Thought (CoTT) process. This process is orchestrated by an Ego-ReaLM trained through reinforcement learning (RL). The approach is inspired by human problem-solving strategies, particularly the decomposition of complex reasoning into modular steps. Each step employs an RL agent to invoke specific tools, answering sub-questions related to tasks such as temporal retrieval and multi-modal understanding.

Methodology

The authors propose a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained LLM with CoTT data and RL to dynamically select step-by-step tools for long-range reasoning. To support this, a dedicated dataset named Ego-R1 Data has been constructed, which includes Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. This dataset facilitates the training process by providing extensive CoTT reasoning traces and annotated QA instances.

The framework employs three core modules for perception, which are specifically designed to enhance temporal retrieval and detailed visual comprehension. These modules include:

Hierarchical Retrieval-Augmented Generation (H-RAG): It extracts timestamped information in the language space to aid retrieval.
Video LLM (Video-LLM): Specialized for interpreting localized visual contexts.
Vision-LLM (VLM): Extracts fine-grained visual details for precise analysis.

The coordination of these tools by an orchestrating LLM enables scalable, step-by-step compositional reasoning over ultra-long videos.

Evaluation

Ego-R1 is evaluated using the newly curated EgoR1QA benchmark, composed of week-long video QA pairs sourced from hybrid human-verified data. The extensive results demonstrate that the tool-augmented chain-of-thought reasoning by Ego-ReaLM effectively handles the challenges of understanding ultra-long egocentric videos, extending the time coverage from a few hours to a week.

Discussion

The introduction of Ego-R1 addresses critical limitations faced by existing frameworks in long video understanding, particularly in scalability and computational challenges. The paper's proposition of dynamic tool-driven reasoning offers a significant advantage over traditional methods that either rely on lossy simplifications or predefined reasoning pipelines. The modular design of the framework facilitates easy integration with state-of-the-art visual understanding models, potentially enhancing its adaptability and robustness across various applications.

Future Prospects

The implications of Ego-R1 extend beyond egocentric video analysis to potential advancements in AI-assisted memory recall, activity tracking, and goal monitoring. The integration of structured, tool-driven reasoning systems introduces promising avenues for research in AI's capability to interpret and reason about complex, temporally-extensive datasets.

Moreover, future developments in AI might focus on further optimizing the dynamic tool selection process, potentially exploring hybrid models that combine symbolic and neural architectures for enhanced performance in open-domain video reasoning tasks.

Overall, Ego-R1 sets a benchmark for ultra-long video understanding frameworks, showcasing the effectiveness of chain-of-tool-thought reasoning in overcoming the innate challenges posed by lengthy egocentric video content.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/shulin_tian/status/1934946742199631880

https://twitter.com/_akhaliq/status/1934987664010166424