Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition (1908.08498v1)

Published 22 Aug 2019 in cs.CV

Abstract: We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities -- RGB, Flow and Audio -- and combine them with mid-level fusion alongside sparse temporal sampling of fused representations. In contrast with previous works, modalities are fused before temporal aggregation, with shared modality and fusion weights over time. Our proposed architecture is trained end-to-end, outperforming individual modalities as well as late-fusion of modalities. We demonstrate the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects. Our method achieves state of the art results on both the seen and unseen test sets of the largest egocentric dataset: EPIC-Kitchens, on all metrics using the public leaderboard.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Evangelos Kazakos (13 papers)
  2. Arsha Nagrani (62 papers)
  3. Andrew Zisserman (248 papers)
  4. Dima Damen (83 papers)
Citations (324)

Summary

Overview of "EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition"

The paper "EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition" presents an innovative architecture known as the Temporal Binding Network (TBN), designed for improving performance in egocentric action recognition through the fusion of multi-modal data. This paper addresses the challenge of combining different sensory modalities, specifically RGB, optical flow, and audio, to augment the understanding of actions performed from a first-person perspective, which is particularly complicated by asynchronies across streams.

Architecture and Methodology

The core contribution of this research is an end-to-end trainable architecture that incorporates a novel concept of a Temporal Binding Window (TBW). The TBW allows asynchronous inputs from each modality to be fused within a pre-defined temporal range, effectively modeling the temporal dynamics inherent in action sequences. Unlike traditional approaches that rely on late fusion techniques, TBN uniquely integrates mid-level feature fusion before temporal aggregation.

The method is grounded in biological plausibility, inspired by the neuroscience concept of temporal binding in human perception. The network exploits the inherent time delays and offsets in the perception of multi-modal signals by enabling shared modality and fusion weights across time steps, thereby capturing both synchronous and asynchronous interactions between different sensory inputs.

Key Results

The experimental evaluation of the proposed TBN architecture is conducted using the EPIC-Kitchens dataset, the largest and most comprehensive dataset of egocentric video data available. The results demonstrate that TBN achieves superior performance on both seen and unseen test splits, surpassing previous benchmarks in action recognition tasks. Notably, the incorporation of audio as a critical modality resulted in significant performance gains, indicating its vital role alongside visual data in recognizing egocentric actions.

The paper quantifies the robustness of TBN to background noise and irrelevant sounds, a prevalent challenge in egocentric recordings captured via head-mounted devices. The architecture exhibits resilience to such perturbations, reaffirming its practical applicability in real-world settings.

Implications and Future Directions

The proposed TBN architecture sets a new standard for multi-modal fusion techniques in egocentric action recognition. The ability to harness asynchronous signals effectively broadens the scope of applications across diverse domains, including sports analysis, health monitoring, and augmented reality.

Future research directions could explore adaptive mechanisms for optimizing the TBW according to varying task characteristics and modality-specific temporal dynamics. Moreover, the integration of additional sensor modalities such as depth data or haptic feedback may further enhance the model's interpretative capabilities.

Overall, the development of TBN represents an advanced step towards more holistic and integrated multi-sensory representations, contributing to the theoretical understanding of multi-modal learning and its practical deployment in numerous cutting-edge AI applications.