MM-Ego: Towards Building Egocentric Multimodal LLMs (2410.07177v1)

Published 9 Oct 2024 in cs.CV, cs.AI, and cs.LG

Abstract: This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. This is currently the largest egocentric QA dataset. Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated. Third, we propose a specialized multimodal architecture featuring a novel "Memory Pointer Prompting" mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content. With the data, benchmark, and model, we successfully build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.

PDF HTML Abstract

Summary of "MM-Ego: Towards Building Egocentric Multimodal LLMs"

The focus of this paper is on the development of MM-Ego, a multimodal LLM (MLLM) designed for understanding egocentric video content. The research targets the unique challenges posed by egocentric videos, which are recorded from a first-person perspective and often involve dynamic scenes of human activities. The authors address gaps in existing data, model design, and benchmarking for this specialized domain by introducing new methodologies and resources.

Contributions

Data Engine for Egocentric QA Generation: The paper describes a novel data engine capable of automatically generating a large-scale dataset of 7 million question-answer (QA) samples derived from human-annotated egocentric video narrations. This represents the largest dataset of its kind, providing essential training material for developing models with strong egocentric video understanding capabilities.
Benchmark Creation: The authors introduce EgoMemoria, a challenging benchmark designed to evaluate the performance of models in understanding and remembering visual details in egocentric videos. The benchmark includes over 7,000 questions across 629 videos, ranging up to an hour long.
Model Architecture: They propose a specialized multimodal architecture incorporating a "Memory Pointer Prompting" mechanism. This allows the model to efficiently identify and process key visual details from extended video content. The method involves a two-step process:
- Global Glimpse: Extracts overarching insights from video frames.
- Fallback: Focuses on detailed visual elements relevant to specific questions.

Numerical Results

The MM-Ego model demonstrates substantial improvements in egocentric video understanding. On the EgoMemoria benchmark, it achieves a Mean Debiased Accuracy (MDA) of 61.27, significantly outperforming baseline models like LLaVA-OV. This highlights the model's ability to accurately comprehend and reason through lengthy egocentric footage.

Implications

The introduction of MM-Ego and its associated training resources represents a step forward in multimodal AI, particularly for applications involving augmented and virtual reality, wearable devices, and autonomous systems. The research underscores the importance of specialized data and model architectures in addressing the nuanced challenges posed by egocentric perspectives.

Future Directions

The paper suggests potential enhancements in data diversity and model capacity to extend MM-Ego's effectiveness in even longer or continuous egocentric video streams. Future work may involve integrating more sophisticated attention mechanisms or expanding the range of tested real-world scenarios.

In conclusion, this work lays a robust foundation for advancing egocentric video understanding in AI, offering vital tools and methodologies for researchers in the field. The MM-Ego model, along with its novel data synthesis approach and rigorous evaluation benchmark, is posited as a cornerstone for ongoing developments in multimodal LLMs.