Papers
Topics
Authors
Recent
2000 character limit reached

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

Published 22 Aug 2019 in cs.CV | (1908.08498v1)

Abstract: We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities -- RGB, Flow and Audio -- and combine them with mid-level fusion alongside sparse temporal sampling of fused representations. In contrast with previous works, modalities are fused before temporal aggregation, with shared modality and fusion weights over time. Our proposed architecture is trained end-to-end, outperforming individual modalities as well as late-fusion of modalities. We demonstrate the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects. Our method achieves state of the art results on both the seen and unseen test sets of the largest egocentric dataset: EPIC-Kitchens, on all metrics using the public leaderboard.

Citations (324)

Summary

  • The paper introduces the Temporal Binding Network (TBN) that integrates asynchronous RGB, optical flow, and audio data using a Temporal Binding Window (TBW).
  • The TBN outperforms previous models on the EPIC-Kitchens dataset by effectively addressing modality asynchronies and enhancing recognition performance.
  • The study demonstrates TBN's resilience to background noise and discusses future improvements with additional sensor modalities.

Overview of "EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition"

The paper "EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition" presents an innovative architecture known as the Temporal Binding Network (TBN), designed for improving performance in egocentric action recognition through the fusion of multi-modal data. This study addresses the challenge of combining different sensory modalities, specifically RGB, optical flow, and audio, to augment the understanding of actions performed from a first-person perspective, which is particularly complicated by asynchronies across streams.

Architecture and Methodology

The core contribution of this research is an end-to-end trainable architecture that incorporates a novel concept of a Temporal Binding Window (TBW). The TBW allows asynchronous inputs from each modality to be fused within a pre-defined temporal range, effectively modeling the temporal dynamics inherent in action sequences. Unlike traditional approaches that rely on late fusion techniques, TBN uniquely integrates mid-level feature fusion before temporal aggregation.

The method is grounded in biological plausibility, inspired by the neuroscience concept of temporal binding in human perception. The network exploits the inherent time delays and offsets in the perception of multi-modal signals by enabling shared modality and fusion weights across time steps, thereby capturing both synchronous and asynchronous interactions between different sensory inputs.

Key Results

The experimental evaluation of the proposed TBN architecture is conducted using the EPIC-Kitchens dataset, the largest and most comprehensive dataset of egocentric video data available. The results demonstrate that TBN achieves superior performance on both seen and unseen test splits, surpassing previous benchmarks in action recognition tasks. Notably, the incorporation of audio as a critical modality resulted in significant performance gains, indicating its vital role alongside visual data in recognizing egocentric actions.

The study quantifies the robustness of TBN to background noise and irrelevant sounds, a prevalent challenge in egocentric recordings captured via head-mounted devices. The architecture exhibits resilience to such perturbations, reaffirming its practical applicability in real-world settings.

Implications and Future Directions

The proposed TBN architecture sets a new standard for multi-modal fusion techniques in egocentric action recognition. The ability to harness asynchronous signals effectively broadens the scope of applications across diverse domains, including sports analysis, health monitoring, and augmented reality.

Future research directions could explore adaptive mechanisms for optimizing the TBW according to varying task characteristics and modality-specific temporal dynamics. Moreover, the integration of additional sensor modalities such as depth data or haptic feedback may further enhance the model's interpretative capabilities.

Overall, the development of TBN represents an advanced step towards more holistic and integrated multi-sensory representations, contributing to the theoretical understanding of multi-modal learning and its practical deployment in numerous cutting-edge AI applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.