EgoAVU-Instruct: Egocentric Audio-Visual Reasoning
- The paper introduces a scalable data engine that processes egocentric video and audio to create vast instruction-tuning datasets and high-fidelity evaluation benchmarks.
- It achieves up to +113.3% improvement on SSA metrics and reduces sound-source hallucination errors from 45–60% to 21.1% through effective fine-tuning regimes like LoRA.
- The work highlights challenges in audio-visual alignment while proposing future directions such as self-supervised pretraining and dynamic adapter strategies for extended video contexts.
Egocentric Audio-Visual Understanding (EgoAVU) refers to the joint computational reasoning over visual and auditory signals in first-person (egocentric) videos to extract rich, temporally and semantically aligned interpretations. EgoAVU explicitly addresses the challenges where traditional multimodal models—especially those designed for exocentric (third-person) footage—fail to robustly ground and associate sound and sight under the rapid viewpoint dynamics, partial observability, and persistent ambient audio characteristic of head-mounted camera footage. These challenges are of critical importance to embodied intelligence research, AR/VR applications, and the broader study of multimodal machine perception in unconstrained real-world settings (Seth et al., 5 Feb 2026).
1. Motivation and Problem Scope
Egocentric recordings, such as those in the Ego4D dataset, capture human actions alongside the corresponding environmental context, yielding rich visual (hand/scene/object) and audio (foreground actions, background ambience) modalities. Practical applications include robot-human collaboration, wearable AR assistants, and context-aware mixed-reality systems.
Egocentric videos pose unique challenges absent in exocentric data: rapid and frequent egomotion, constrained and shifting fields of view, persistent occlusion, and the presence of audio signals with sources both in- and out-of-view. Visual-only understanding degrades under these conditions, while audio alone lacks specific scene grounding. Robust egocentric audio-visual understanding is thus critical.
A major impediment for progress in this domain is the absence of large-scale, high-diversity, and label-rich egocentric audio-visual datasets, as well as robust evaluation benchmarks specifically probing joint reasoning. Most existing multimodal LLMs (MLLMs) are trained on exocentric video or static images and perform poorly on sound-source grounding, often hallucinating and demonstrating strong vision dominance (Seth et al., 5 Feb 2026).
2. EgoAVU Data Engine and Corpus Construction
EgoAVU introduces a scalable and automated data engine that processes raw egocentric video and audio alongside human narrations to yield vast, instruction-tuning datasets and high-fidelity evaluation benchmarks. The pipeline comprises several stages:
- Data Aggregation and Segment Expansion: Ego4D video clips and narrations are expanded temporally using an adaptive window:
where is the time to the next narration, and is the global mean interval.
- Narration Enhancement: Uni-modal captioners (Qwen2.5-VL for images, Qwen2.5-Omni for video and audio) generate time-aligned captions for actions, objects, and sounds.
- Diversity Filtering: Segment narrations are concatenated to tokens , and the moving-average type-token ratio (MATTR) is computed:
A threshold prunes low-diversity, repetitive clips, yielding approximately 9,900 diverse videos.
- Multimodal Context Graph (MCG): An LLM (LLaMA-70B) parses the enhanced narrations into a structured context graph encoding interacted/background objects, foreground/background sounds (linked to sources when possible), and their roles. This graph augments final audio-visual narrations generated by the LLM.
- QA Generation: Five tasks—three open-ended (Sound–Source Association, Audio–Visual Segment Narration, Audio–Visual Dense Narration) and two closed-ended (Temporal Reasoning, Audio–Visual Hallucination)—are constructed using prompt-based Q–A generation from the audio-visual narrations.
The resulting corpus comprises EgoAVU-Instruct (≈3M QAs across 9,000 videos, average duration ≈4 min) for training and EgoAVU-Bench (3,000 QAs across 900 videos, human-verified) for evaluation (Seth et al., 5 Feb 2026).
3. Benchmarking and Fine-Tuning Regimes
EgoAVU-Bench enables granular assessment of egocentric audio-visual reasoning, exposing modality biases and blind spots of leading MLLMs. Evaluated tasks and metrics include:
- SSA, AVSN, AVDN: Judged with both LLM-as-Judge score (1–5), METEOR, and ROUGE-L.
- Temporal Reasoning (TR) and Audio–Visual Hallucination (AVH): Multiple-choice and yes/no accuracy.
Fine-tuning Qwen2.5-Omni (7B parameters), under full and LoRA adaptation regimes, with balanced task sampling, leads to pronounced performance improvements. Notably, LoRA matches full-model fine-tuning across all metrics. The model's gains on EgoAVU-Bench reach up to +113.3% (SSA). Closed-ended reasoning, especially for sound recognition, remains the most error-prone (accuracy: 20–35%) but improves substantially upon finetuning (by ~30%) (Seth et al., 5 Feb 2026).
Crucially, EgoAVU-driven fine-tuning transfers to other egocentric benchmarks: EgoTempo (+28.1% accuracy), EgoIllusion (+7.2%), and yields negligible regression or even slight gains for exocentric datasets (e.g., VideoMME, AVQA).
4. Error Analysis and Qualitative Insights
Empirical evaluation on EgoAVU-Bench and cross-benchmark transfer highlight several persistent issues:
- Vision-dominant bias: Baseline models consistently favor visual information, neglecting or misattributing audio cues.
- Sound-source grounding failures: Inability to reliably associate foreground sounds with their visual sources, often leading to hallucinated events or objects.
- Improvement post-finetuning: Sound-source association error rate drops markedly—from 45–60% in the baseline to 21.1% post-LoRA adaptation; the majority of residual errors pertain to sound description quality, not grounding.
- Qualitative improvement: Fine-tuned models provide temporally coherent, richly detailed audio-visual narrations (e.g., correctly attributing “crackling of oil” to a frying pan) and reduce nonsensical hallucinations (Seth et al., 5 Feb 2026).
5. Future Directions and Limitations
Despite significant progress, several limitations persist:
- Training noise: Quality of pseudo-labels constrained by the outputs of underlying LLMs and captioners; hallucinations remain.
- Audio-visual alignment: The pipeline does not yet leverage explicit joint contrastive losses for cross-modal embedding alignment.
- Long-form context representation: Temporal window lengths are limited (<6 minutes), restricting episodic and anticipation tasks.
Future research priorities include:
- Self-supervised audio-visual pretraining to improve fundamental cross-modal alignment.
- Extending the MCG to encode object affordances and spatial relations explicitly.
- Adoption of dynamic adapter strategies for robust handling of extended video contexts.
- Integration of closed-loop, active learning for continual data engine refinement (Seth et al., 5 Feb 2026).
6. Relation to Egocentric Audio-Visual Localization
EgoAVU builds on foundational research in egocentric audio-visual object localization (Huang et al., 2023), where the goal is to localize sounding objects on a per-frame basis. Key contributions from this line include geometry-aware temporal aggregation (GATA) and cascaded feature enhancement (CFE) modules to address egomotion and out-of-view signals. These approaches leverage contrastive, self-supervised objectives exploiting temporal audio-visual synchronization and have culminated in benchmark datasets such as the Epic Sounding Object dataset. EgoAVU extends this paradigm from spatial localization to high-level narrative and reasoning tasks over egocentric multi-modal streams (Huang et al., 2023, Seth et al., 5 Feb 2026).