Audiovisual SlowFast Networks for Video Recognition (2001.08740v2)

Published 23 Jan 2020 in cs.CV

Abstract: We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features. Code will be made available at: https://github.com/facebookresearch/SlowFast.

PDF Abstract

Audiovisual SlowFast Networks for Video Recognition

The paper presents Audiovisual SlowFast Networks (AVSlowFast), an integrated architecture aimed at advancing the field of video action recognition by incorporating both visual and audio data into a singular processing framework. This development builds on the existing SlowFast architecture, well-regarded for its dual-pathway approach to capturing semantic content and motion with different temporal frequencies. AVSlowFast introduces a Faster Audio pathway, complementing the Slow and Fast visual streams, to enhance the model's perceptual capabilities by embedding sound into the feature extraction process at various network layers.

Architecture and Methodology

The AVSlowFast model is designed to harness the potential of audio in video understanding tasks. The architecture involves:

Slow and Fast Visual Pathways: These pathways capture static semantic information and motion dynamics with different temporal resolutions.
Faster Audio Pathway: This pathway operates at a higher sampling rate than the visual streams, allowing audio to contribute efficiently with relatively low computational overhead.
Lateral Connections: Audio and visual features are integrated at multiple layers to develop hierarchical audiovisual concepts, allowing sound to participate in visual feature formation.

To mitigate the discrepancy in learning dynamics between audio and visual modalities, the paper introduces the DropPathway technique, which acts as a regularization mechanism by randomly omitting the Audio pathway during training, thus aligning the learning pace between modalities.

Empirical Results

The effectiveness of AVSlowFast is demonstrated through experiments on several standard video recognition datasets. The model achieves state-of-the-art results across six datasets, most notably:

EPIC-Kitchens: Gains of +2.9% to +4.3% in top-1 accuracy for verb, noun, and action categories, showcasing audio's significant impact in egocentric video recognition.
Kinetics-400: A notable 1.4% top-1 accuracy improvement over the SlowFast baseline in video action classification tasks.
Charades and AVA: Substantial improvements with a light computational footprint, reinforcing the utility of audiovisual integration.

Implications and Future Directions

The integration of audio in video recognition represents a meaningful step in mirroring multisensory human perception within AI models. By demonstrating improved classification accuracy at a reasonable computational cost, AVSlowFast highlights the potential for wider applications in contexts where sound can play a pivotal role in disambiguating actions captured in video footage. The methodology underscores the importance of interdisciplinary insights, such as those from neuroscientific studies on multisensory integration, in guiding the development of more robust models.

Looking forward, the architecture's success invites further exploration in several directions, including:

Expanding AVSlowFast’s applicability to more diverse datasets where audio may play different roles.
Investigating novel training mechanisms to further enhance joint modality learning.
Exploring the model's effectiveness in other multisensory machine learning contexts, including self-supervised learning frameworks.

The introduction of AVSlowFast marks a significant enhancement in video action recognition architectures, providing a solid foundation for future advancements in integrated audiovisual AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Fanyi Xiao (25 papers)
Yong Jae Lee (88 papers)
Kristen Grauman (136 papers)
Jitendra Malik (210 papers)
Christoph Feichtenhofer (52 papers)

Citations (200)

View on Semantic Scholar

Audiovisual SlowFast Networks for Video Recognition (2001.08740v2)

Audiovisual SlowFast Networks for Video Recognition

Architecture and Methodology

Empirical Results

Implications and Future Directions

Related Papers

GitHub

YouTube