Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EgoLife: Towards Egocentric Life Assistant (2503.03803v1)

Published 5 Mar 2025 in cs.CV

Abstract: We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.

Summary

  • The paper introduces EgoLife for creating an egocentric life assistant, releasing a 300-hour multimodal dataset and the EgoLifeQA benchmark for daily life QA tasks.
  • EgoButler is introduced as an integrated system using EgoGPT for omni-modal understanding and EgoRAG for retrieval-augmented long-context egocentric QA.
  • Evaluation shows EgoGPT excels in egocentric understanding, and EgoGPT+EgoRAG significantly improves ultra-long-context QA performance over 24+ hours compared to baselines.

The paper introduces EgoLife, a project designed to create an egocentric life assistant using AI-powered wearable glasses. To facilitate this, the authors conducted a comprehensive data collection paper, resulting in the EgoLife Dataset, which contains 300 hours of multimodal daily life recordings from six participants living together for a week. The dataset includes egocentric, interpersonal, and multiview data, along with intensive annotations. The paper also introduces EgoLifeQA, a benchmark comprising long-context, life-oriented question-answering tasks designed to provide assistance in daily life.

To address the technical challenges of egocentric data, the authors introduce EgoButler, an integrated system composed of EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance in egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions.

Here's a breakdown of the key components and contributions:

  • EgoLife Dataset: A 300-hour dataset of egocentric, interpersonal, multiview, and multimodal daily life recordings. It includes synchronized third-person perspectives captured from 15 additional cameras and two mmWave devices.
  • EgoLifeQA Benchmark: A set of long-context question-answering tasks designed to evaluate the effectiveness of personalized AI assistance in practical, everyday scenarios.
  • EgoButler System: An integrated system comprising EgoGPT and EgoRAG, designed to address the challenges of long-context understanding, multimodal integration, and personalized assistance in egocentric AI.
    • EgoGPT: An omni-modal model fine-tuned on egocentric datasets for multimodal video understanding.
    • EgoRAG: A retrieval-augmented generation module for long-context question answering.

The paper involved six participants living together for one week, recording their daily activities using Meta Aria glasses [engel2023project]. Fifteen additional cameras and two mmWave devices provided synchronized third-person perspective data. The dataset includes activities such as discussions, shopping, cooking, socializing, and entertainment.

The EgoLifeQA benchmark includes tasks that assess the ability of an AI assistant to:

  • Locate misplaced items.
  • Recall past events.
  • Track health habits.
  • Analyze social interactions.
  • Make timely recommendations.

The EgoButler system addresses key challenges:

  • Developing robust omni-modal models for egocentric contexts.
  • Achieving accurate recognition and tracking of individuals.
  • Enabling ultra-long-context question answering over extended temporal sequences.

The authors evaluated EgoGPT on egocentric datasets such as EgoSchema [mangalam2023egoschema], EgoPlan-Bench [chen2023egoplan], and EgoThink [cheng2024egothink]. EgoGPT was fine-tuned on LLaVA-OneVision [li2024llava] and incorporates audio understanding capabilities.

EgoRAG enhances memory and query capabilities through a two-stage approach: memory bank construction and content retrieval with response generation. The memory bank, MM, is defined as:

M={(ci,di,ti)}i=1NM = \{(c_i, d_i, t_i)\}_{i=1}^N

where:

  • cic_i represents clip features.
  • did_i represents textual descriptions.
  • tit_i represents timestamped summaries.

The relevance-based scoring function, sis_i, is:

si=Similarity(q,ci)+λSimilarity(q,di)s_i = \text{Similarity}(q, c_i) + \lambda\text{Similarity}(q, d_i)

where:

  • qq represents the question.
  • λ\lambda balances visual and textual relevance.

The top-kk most relevant clips, RR, are selected as:

R=TopK({(ci,di,si)}i=1N)R = \text{TopK}(\{(c_i, d_i, s_i)\}_{i=1}^N)

The final response, rr, is generated using a LLM:

r=EgoGPT/GPT(q,R)r = \text{EgoGPT/GPT}(q, R)

The paper highlights the impact of EgoRAG on long-context question answering. EgoGPT+EgoRAG achieved a score of 35.4 for queries spanning over 24 hours, outperforming both EgoGPT and Gemini-1.5-Pro [geminiteam2024geminifamilyhighlycapable]. The authors also present an ablation paper on EgoGPT variants, showing that combining visual and audio inputs yields the best performance.

The authors acknowledge the limitations of EgoGPT, including incomplete speech understanding and challenges in identity recognition. Future improvements include enhancing speech comprehension, refining personalization strategies, and incorporating more advanced retrieval reasoning techniques.

In summary, the paper presents a comprehensive approach to egocentric AI, with contributions including a unique dataset, a challenging benchmark, and a novel system designed to address the complexities of long-term, personalized AI assistance.