Papers
Topics
Authors
Recent
2000 character limit reached

Ego-EXTRA: Egocentric Video-Language Dataset

Updated 18 December 2025
  • Ego-EXTRA is a large-scale egocentric video–language dataset capturing real-time expert–trainee dialogue during structured procedural tasks.
  • It integrates multimodal sensor data—including high-resolution video, eye tracking, and audio—to benchmark complex, context-rich visual question answering.
  • The dataset supports research in AR/VR training, intelligent tutoring, and remote assistance by simulating real-world procedural guidance.

Ego-EXTRA is a large-scale video-language egocentric dataset designed to facilitate research in expert–trainee assistance by capturing procedural task performance in realistic settings, mediated by high-fidelity, natural language two-way dialogue. It targets the benchmarking and advancement of multimodal video-language assistants envisioned for real-time, expert-level procedural guidance in domains such as maintenance, cooking, assembly, and hands-on training across multiple application areas (Ragusa et al., 15 Dec 2025).

1. Definition, Rationale, and Scope

Ego-EXTRA comprises approximately 50 hours of unscripted egocentric video (123 sessions) of naive participants (“trainees”) performing structured procedural tasks while receiving real-time, expert-level guidance from domain professionals (“experts”) via natural language dialogue. The dataset emphasizes a “Wizard of Oz” data collection protocol, where the expert remotely simulates a wearable AI assistant by accessing the trainee’s egocentric sensor suite and providing task feedback, suggestions, or answers in natural dialogue, without the trainee’s awareness that the support is not automated.

The dataset’s core objective is to provide a benchmark that accurately reflects the complexity and requirements of procedural video–language understanding and expert-level assistance. It distinctly captures realistic, naturalistic multi-turn dialogue, contextually grounded in a first-person visual stream, presenting challenges that contemporary multimodal LLMs (MLLMs) do not yet solve robustly (Ragusa et al., 15 Dec 2025).

2. Data Collection Methodology

Participants and Scenarios

The corpus consists of 123 sessions recorded with 33 volunteers (trainees) with no prior task experience and 4 professional experts (covering bike maintenance, bakery, kitchen, and assembly domains). The four scenarios encompass ten procedural tasks such as bicycle brake pad replacement, assembling an IKEA chair, and cooking a spinach tart, with average activity durations of approximately 24.4 minutes.

Multimodal Instrumentation

Trainees wore Meta ARIA smart glasses to record high-resolution egocentric video (RGB 1408×1408 at 15 FPS), SLAM and eye-tracking signals (30 FPS each), IMU, magnetometer, barometer, GPS, BLE/Wi-Fi, and per-frame hand keypoints. Audio was captured via a paired smartphone for both parties; the expert used a laptop, microphone/earbud, and a Tobii Pro Fusion Bar for continuous gaze tracking. Visual and gaze cues from both trainee and expert were spatiotemporally synchronized through a QR-code-based homography and a manual countdown protocol.

Dialogue and QA Capture

All verbal guidance, feedback, and trainee queries were recorded, time-stamped, and subsequently transcribed and translated (using Llama 3.1 with manual correction). The protocol enables data capture of both spontaneous trainee-initiated questions and proactive expert instructional turns. This yields a corpus of naturalistic, context-sensitive guidance directly responsive to real-time task progress.

3. Dataset Structure, Annotations, and Statistics

Corpus Organization

Ego-EXTRA consists of:

  • NV=123N_V=123 egocentric video sessions, total duration Ttotal=50T_{total}=50 hours.
  • Over NQA15,000N_{QA}\approx 15{,}000 automatically extracted and manually validated multiple-choice visual question-answer (VQA) sets, directly sourced from the expert–trainee dialogue.

The dataset is released as a unified corpus; explicit train/validation/test splits are not provided, enabling flexible partitioning strategies such as cross-validation or held-out scenario evaluation.

Annotation Schema

Each session includes:

  • Fully time-stamped dialogue transcripts with speaker identification.
  • VQA annotations formalized as JSON records: question, five answer options, ground-truth, timestamps, list of corresponding dialogue turns, scenario/activity labels, protocol metadata (procedure analysis or on-demand), and participant IDs.

Accompanying metadata includes scenario and procedural labels, speaker word counts and turn numbers, gaze overlays, and raw sensor streams.

Statistic Ego-EXTRA Ego4D HoloAssist EPIC-Kitchens
Hours 50 288.7 49.8 25.3
Scenarios 4 multiple 1 (lab) kitchens
Expert Q&A ✓ (no raw)
Modalities RGB, SLAM… RGB… RGB, depth RGB
QA sets 15,000

Table: Dataset comparison adapted from (Ragusa et al., 15 Dec 2025), Tab. 1.

4. Benchmark Tasks and Baseline Evaluation

Primary Benchmark

The principal challenge is multiple-choice visual question answering (MC-VQA): models receive a 5-second egocentric video context and a 5-way multiple-choice question, and are required to select the correct answer.

Baseline Performance

  • Language-only models (Llama 3.1-8B, 70B) achieve low accuracy (8.7%, 26.7% respectively), demonstrating task difficulty in the absence of visual input.
  • State-of-the-art MLLMs:
    • LLaVA-OneVision: 33.1%
    • Qwen 2.5-VL: 31.1%
    • LLaVA-Video: 28.6%
    • MiniGPT4-video: 10.7%
  • Human oracle performance: 89.7%

Performance is scenario-dependent: kitchen-based tasks are marginally easier for MLLMs compared to assembly and mechanical domains. Overall, current MLLMs exhibit significant challenges in spatial reasoning, temporal context integration, and expert-level procedural understanding (Ragusa et al., 15 Dec 2025).

5. Applications, Extensions, and Comparative Context

Ego-EXTRA supports a range of research and application domains including:

  • Wearable video–language assistants for real-time procedural guidance in maintenance, repair, and culinary contexts.
  • Augmented and virtual reality (AR/VR) training systems simulating expert–trainee dialogue for skill acquisition and evaluation.
  • Intelligent tutoring and automated remote support, where systems trained on Ego-EXTRA can answer user queries during task execution or provide proactive assistance.
  • Future research avenues include mistake detection, step segmentation, procedural dialogue generation, and the development of assistants with longer-term memory and multi-turn reasoning.

Relative to datasets such as Ego4D, HoloAssist, and EPIC-Kitchens, Ego-EXTRA is distinguished by its realistic two-way, expert–trainee dialogue, the tight coupling of egocentric video with context-rich natural language QA, and full-relational sensor and gaze stream capture.

6. Limitations and Ethical Considerations

Ego-EXTRA spans four scenarios and ten procedural tasks, representing a subset of possible real-world procedures. Expert dialogue seeding of QA pairs may introduce annotation biases tied to specific phrasing or instructional style. While participants consented to visual and transcript release, raw audio is withheld for privacy. Deployment of similar protocols in uncontrolled environments raises potential surveillance concerns; robust privacy and ethical safeguards are necessary. Further iterations should expand demographic diversity, explore additional domains, and investigate privacy-preserving sensing modalities (Ragusa et al., 15 Dec 2025).

Ego-EXTRA, including all videos, dialogue transcripts, QA annotations, and sensor streams, is publicly accessible to foster benchmarking of egocentric video-language assistants. The dataset and benchmark details, as well as protocols for extending annotation schemas or integrating new domains, can be accessed at https://fpv-iplab.github.io/Ego-EXTRA/.

For related large-scale egocentric and procedural datasets with complementary benchmarks (e.g., cross-view association, skill assessment), refer to works such as EgoExoLearn (Huang et al., 24 Mar 2024). A plausible implication is that joint utilization of these datasets may allow for benchmarking AI systems capable of both fine-grained procedural guidance and the learning of skills through observation across multiple modalities and viewpoints.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Ego-EXTRA Dataset.