Agentic Very Long Video Understanding

Published 26 Jan 2026 in cs.CV and cs.LG | (2601.18157v1)

Abstract: The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including LLMs and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces EGAgent, a framework that employs a temporally annotated entity scene graph to enable multi-hop reasoning over long egocentric videos.
EGAgent integrates specialized tools for hybrid visual and transcript search, achieving state-of-the-art results on benchmarks like EgoLifeQA and Video-MME.
By combining LLM-based reasoning with structured retrieval, the method offers scalable, precise long-horizon video comprehension with applications in personal AI and behavioral analytics.

Agentic Very Long Video Understanding: Technical Summary

Motivation and Problem Scope

The proliferation of wearable devices that continuously record user experiences demands robust methods for longitudinal video understanding—processing and reasoning over multi-day or week-long egocentric video streams. Classical video analysis frameworks, including Multimodal LLMs (MLLMs), standard Retrieval Augmented Generation (RAG), and prior agentic systems, are constrained by short context windows, poor compositional reasoning capabilities, and limited cross-modal integration. These constraints become acute for tasks requiring entity-centric temporal tracking, multi-hop reasoning, and integrating audio, visual, and relational data over extended time horizons.

EGAgent Framework: Entity-Centric, Agentic Video Reasoning

The proposed EGAgent framework introduces a scalable, entity-centric mechanism for long-horizon video comprehension, centered on a temporally-annotated entity scene graph representation. Nodes denote entities (persons, objects, locations), and edges capture relations (talks-to, interacts-with, mentions, uses), each with explicit temporal intervals. This graph provides both an efficient index and a structured substrate for complex queries on social, behavioral, and spatiotemporal patterns.

EGAgent's architecture consists of:

Planning Agent: Decomposes complex queries into sub-tasks, assigns each to specialized tools.
Retrievers: Support hybrid semantic/attribute visual search, transcript search (LLM or BM25), and entity graph search via progressive SQL-based constraint relaxation.
Analyzer Tool: Employs LLM-based reasoning and evidence distillation for each retrieval.
VQA Agent: Synthesizes cross-modal evidence from working memory to generate coherent answers.

This modular design enables agentic decomposition, cross-modal fusion, and compositional multi-step reasoning without context window collapse.

Entity Graph Construction and Temporal Annotation

Entity graph extraction leverages LLMs over scene descriptions, predicted locations, and audio transcripts to jointly detect entities and annotate relationships with temporality. Edges are stored as tuples: source, source type, target, target type, relation, start time, end time, text support. The system supports incremental updates with new data, ensuring scalability to streaming inputs.

Temporal annotation strictly exploits transcript timestamping when available; otherwise, scene intervals are used, allowing both fine-grained and coarse localization. The graph is persisted as a SQLite3 database for low-latency querying.

Experimental Validation

EgoLifeQA (50+ hour egocentric video, 500 MCQs)

EGAgent establishes state-of-the-art performance, reaching 57.5% average MCQ accuracy on EgoLifeQA, markedly surpassing prior baselines (Gemini 2.5 Pro, EgoButler). Gains are pronounced in categories demanding multi-hop relational reasoning (RelationMap: +32%, TaskMaster: +39.7%), validating the importance of entity-centric, temporally-structured search. Notably, integrating the entity graph in agentic planning yields significant improvements across different LLM model backbones, demonstrating generality.

Video-MME (Long)

On Video-MME's long subset, EGAgent (with Gemini 2.5 Pro backbone) achieves competitive 74.1% accuracy, outperforming RAG and matching adaptive graph-augmented agents, despite requiring an order-of-magnitude fewer processed frames. Scaling with video length shows diminishing returns for naive uniform frame sampling, further highlighting the need for structured retrieval mechanisms.

Ablation Analyses

Entity graph extraction using transcript-fused visual captions increases MCQ accuracy by ~2.6% over transcript-only variants. LLM-based transcript search outperforms BM25 by 6.8% absolute accuracy but incurs higher token usage and latency. Oracle experiments reveal substantial headroom in temporal localization, with upper bound accuracy for oracle search reaching 68.7%. Tool ablations substantiate cross-modal retrieval and entity graph search as critical for peak performance.

Implications and Theoretical Insights

Agentic frameworks enriched with entity-centric relational graphs and multi-modal reasoning tools resolve fundamental limitations in current LLM video understanding approaches: scalability with context length, compositional reasoning, and precise temporal localization. Structured entity graphs provide a substrate for multi-hop analytics, habit tracking, and inter-agent/scene interaction modeling, opening avenues for persistent memory and personalized assistive agents.

Practically, such methods can be applied to life-logging, personal assistant augmentation, behavioral analytics, and privacy-preserving compliance checking (with appropriate safeguards for entity extraction and data handling).

Theoretically, entity-centric temporal graphs facilitate new research on graph-based query planning, cross-modal traversal algorithms, incremental evidence accumulation, and integration with emerging streaming LLM architectures. The demonstration of strong gains across tasks requiring detailed relational reasoning suggests future work may focus on hierarchical planning agents, graph augmentation with open-vocabulary relations, and low-latency, self-improving graph construction.

Conclusion

The EGAgent framework advances longitudinal video understanding by integrating structured entity scene graphs, robust cross-modal retrieval, and modular agentic planning. It delivers marked improvements in tasks requiring multi-hop, temporally-expansive reasoning over egocentric video streams, and serves as an extensible blueprint for the development of persistent personal AI agents. The implications extend from practical deployment scenarios to foundational theory in agentic and relational multimodal perception.

Markdown