DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Published 18 Sep 2024 in cs.MM, cs.CV, cs.SD, and eess.AS | (2409.11729v1)

Abstract: Current audio-visual representation learning can capture rough object categories (e.g., animals'' andinstruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like dogs'' andflutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification.