EgoAVU: Egocentric Audio-Visual Engine
- EgoAVU is a modular data engine that transforms egocentric videos into richly annotated audio-visual datasets, addressing key modality biases.
- It uses automated pipelines for narration enhancement, lexical diversity filtering, and structured QA generation to enrich first-person video content.
- EgoAVU improves multimodal performance in both instruction-tuning and zero-shot tasks, overcoming limitations of exocentric training data.
EgoAVU is a scalable data engine and dataset suite designed to advance comprehensive egocentric (first-person) audio-visual understanding for embodied intelligence applications. Addressing key bottlenecks in current multimodal LLMs (MLLMs)—notably, their vision-dominant bias and inability to ground auditory information in egocentric settings—EgoAVU provides automated pipelines for generating large-scale, high-fidelity audio-visual narrations, questions, and answers, as well as rigorously curated evaluation benchmarks. By enriching egocentric video content through multimodal context modeling, lexical diversity filtering, and joint narration-generation, EgoAVU enables substantial improvement in MLLM performance on egocentric video reasoning tasks, both in instruction-tuning and zero-shot transfer scenarios (Seth et al., 5 Feb 2026).
1. Motivation and Limitations of Prior Approaches
Egocentric videos capture human activities from a first-person perspective and contain rapidly changing scenes, occlusions, and a limited field of view. Visual-only methods suffer under these conditions due to rapid head/body motion and objects frequently moving out of sight. Audio signals, however, offer persistent cues such as ambient noise and object interaction sounds, which often remain detectable even when their sources are not visually present.
Existing MLLMs, such as Qwen2.5-Omni and Video-LLaVA2, are primarily pre-trained on exocentric datasets (AVQA, OmniBench). These models exhibit several critical limitations:
- Poor sound-source grounding: inability to associate auditory cues with visual entities (e.g., matching a hissing sound to a kettle).
- Hallucination: generation of non-existent sounds or objects during inference.
- Vision-dominant modality bias: frequent neglect of audio input or mismatch between audio and visual context. Benchmarks such as EgoTempo, EgoIllusion, and EgoSchema largely test visual reasoning, ignoring the integration of detailed auditory signals. The scarcity of large-scale, high-quality egocentric audio-visual datasets compounds these challenges, impeding progress in robust multimodal understanding (Seth et al., 5 Feb 2026).
2. Data Engine Architecture and Dataset Construction
EgoAVU comprises an end-to-end, modular data-engine designed to transform raw egocentric video material (e.g., Ego4D) into large-scale instructive corpora and benchmarks. The pipeline consists of several stages:
- Data Collection and Segmentation: Human narrations from Ego4D are mapped to temporal windows around each narration timestamp via:
where is the interval to the next narration, merged to produce clips of 10–360 s duration.
- Narration Enhancement: Each segment is augmented by applying:
- Image captioning (Qwen2.5-VL) for center frames.
- Video captioning (Qwen2.5-Omni, visual channel only).
- Audio captioning (Qwen2.5-Omni, audio channel only). This provides time-aligned, uni-modal narrations for objects, actions, and sounds.
- Video Filtering for Lexical Diversity: Clips are filtered by computing the Moving-Average Type-Token Ratio (MATTR) over tokenized narrations:
Videos scoring below a threshold () are discarded, reducing repetitive/static content and yielding roughly 9,900 diverse clips.
- Audio-Visual Narration Synthesis: Enhanced narrations are parsed into a structured Multimodal Context Graph (MCG) using LLaMA-70B, extracting interacted objects, background entities, foreground/background sounds (with source grounding), and then generating a unified audio-visual narration per clip.
- QA Generation: A taxonomy of five task types structures the Q&A pairs:
- Open-ended: Sound-Source Association (SSA), Audio-Visual Segment Narration (AVSN), Audio-Visual Dense Narration (AVDN).
- Closed-ended: Temporal Reasoning (TR; multiple-choice), Audio-Visual Hallucination (AVH; yes/no). Custom prompts are used to generate QAs for each narration.
The resulting datasets are:
- EgoAVU-Instruct: 9,000 egocentric videos, ∼3M QAs, mean duration 4 min, spanning open and closed formats.
- EgoAVU-Bench: 900 distinct videos, 3K manually verified QAs, balanced task coverage, with 225 total hours of human annotation (Seth et al., 5 Feb 2026).
3. Model Training, Fine-Tuning, and Evaluation
Fine-tuning and evaluation on EgoAVU utilize Qwen2.5-Omni (7B parameters), under both LoRA adaptation and full model training. Key parameters include 64× NVIDIA H100 GPUs, 5 epochs, learning rate , 300 uniformly sampled frames/video at 1 FPS, spatial resolution, and balanced task sampling.
Evaluation is conducted across all five Q&A task types using the “Bench” test split. Metrics include LLM-as-Judge score (1–5), METEOR, and ROUGE-L for open-ended, and accuracy for closed-ended tasks. Representative results are summarized below:
| Task | Open-Source Best | EgoAVU (LoRA) | Relative Improvement (%) |
|---|---|---|---|
| SSA (S) | 1.50 | 3.15 | +113.3 |
| AVDN (S/M/R) | 2.37/10.69/14.74 | 2.60/12.20/17.19 | +12.2/+16.9/+17.2 |
| AVSN (S/M/R) | 1.99/9.99/13.39 | 2.45/22.53/28.34 | +27.6/+86.5/+69.8 |
| TR (Accuracy) | 53.20% | 64.31% | +27.2 |
| AVH (Accuracy) | 42.69% | 61.69% | +30.8 |
LoRA adaptation achieves near-parity with full fine-tuning. These gains transfer to related benchmarks: improvements of +28.1% on EgoTempo, +7.2% on EgoIllusion, minimal change for EgoSchema, and stable performance on exocentric datasets (VideoMME, AVQA) (Seth et al., 5 Feb 2026).
4. Error Patterns, Biases, and Qualitative Insights
Analysis of model predictions reveals modality biases and persistent challenges:
- On closed-ended tasks, baseline sound recognition accuracy is 20–35%, lower than objects (40–65%) and actions (20–60%); post-fine-tuning, sound-related performance rises by ∼30%, actions by 15–20%, and objects by 10%.
- In open-ended SSA, baseline error rates are 45–60% (majority due to incorrect/missing sound descriptions rather than source attribution); fine-tuned EgoAVU reduces error rate to 21.1%.
Qualitative inspection shows that prior models (e.g., VideoLLaMA2, baseline Qwen2.5-Omni) often hallucinate non-existent objects/sounds or neglect audio streams, while EgoAVU-enhanced models exhibit correct sound-source linking (e.g., attributing “crackling oil” to the frying pan) and temporally coherent dense narrations (Seth et al., 5 Feb 2026).
5. Contributions, Limitations, and Future Directions
EgoAVU’s core contributions include:
- A scalable, modular data engine for transforming egocentric video into richly annotated, diversified multi-task audio-visual datasets.
- EgoAVU-Instruct: the first large-scale (3M QAs) corpus for instruction-tuning egocentric audio-visual understanding in MLLMs.
- EgoAVU-Bench: a rigorous benchmark specifically targeting joint reasoning over vision and sound in egocentric contexts.
- Empirical findings on the vision bias and sound-grounding failure modes of baseline models, and demonstration that instruction-tuning on high-quality egocentric data narrows these modality gaps.
Outstanding limitations include training data noise (related to hallucinations and imperfect LLM-generated narrations), lack of explicit audio-visual embedding alignment (e.g., via joint contrastive losses), and no dynamic adaptation mechanisms for long-form video. Suggested future work includes: integrating self-supervised pretraining for cross-modal alignment, extending the Multimodal Context Graph to affordance and spatial reasoning, developing adaptive strategies for longer video contexts, and incorporating closed-loop active learning for ongoing pipeline refinement (Seth et al., 5 Feb 2026).
A plausible implication is that EgoAVU will enable the next generation of embodied intelligence systems, providing the data and evaluation resources required for robust, real-world joint audio-visual reasoning at scale.