Self-supervised Contrastive Learning for Audio-Visual Action Recognition (2204.13386v2)

Published 28 Apr 2022 in cs.CV

Abstract: The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. In this paper, we propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (AVCL), to learn discriminative audio-visual representations for action recognition. Specifically, we design an attention based multi-modal fusion module (AMFM) to fuse audio and visual modalities. To align heterogeneous audio-visual modalities, we construct a novel co-correlation guided representation alignment module (CGRA). To learn supervised information from unlabeled videos, we propose a novel self-supervised contrastive learning module (SelfCL). Furthermore, we build a new audio-visual action recognition dataset named Kinetics-Sounds100. Experimental results on Kinetics-Sounds32 and Kinetics-Sounds100 datasets demonstrate the superiority of our AVCL over the state-of-the-art methods on large-scale action recognition benchmark.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (3)

Yang Liu (2253 papers)
Ying Tan (77 papers)
Haoyuan Lan (3 papers)

Citations (5)

View on Semantic Scholar

Self-supervised Contrastive Learning for Audio-Visual Action Recognition (2204.13386v2)

Related Papers