Wav2CLIP: Learning Robust Audio Representations From CLIP (2110.11499v2)

Published 21 Oct 2021 in cs.SD, cs.LG, and eess.AS

Abstract: We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.

PDF Abstract

Wav2CLIP: Learning Robust Audio Representations From CLIP

The paper under examination presents Wav2CLIP, a novel method for learning audio representations by leveraging Contrastive Language-Image Pre-training (CLIP). This research stands out for its methodical approach to evaluating Wav2CLIP across a breadth of audio tasks—including classification, retrieval, and generation. The results demonstrate that Wav2CLIP can outperform several existing pre-trained audio representation models while also enabling impressive multimodal applications, such as zero-shot classification and cross-modal retrieval.

Methodology Overview

The methodological core of Wav2CLIP involves distilling an audio model from the CLIP framework. CLIP, renowned for its effective use of large-scale image and text data, employs a contrastive loss to map images and text into a shared embedding space. Wav2CLIP adapts this by training a new audio encoder to predict the same CLIP embeddings from audio alone, using the visual features from videos as supervisory signals. This eliminates the need for joint training of visual and audio models, making Wav2CLIP lightweight and efficient compared to other multimodal learning frameworks.

Experimental Evaluation

The paper reports an extensive experimental validation of Wav2CLIP, where it is set against multiple established audio representation models, such as OpenL3 and YamNet. On tasks such as audio classification and retrieval, Wav2CLIP exhibits comparable or superior performance. Notably, the paper also explores zero-shot tasks, underscoring Wav2CLIP's capacity to perform audio classification without task-specific training—although zero-shot performance is understandably below that of models fine-tuned for each task.

Quantitative Findings

Classification Tasks: Wav2CLIP achieves high accuracy on datasets like UrbanSound8K and ESC-50, performing better overall compared to OpenL3 and in line with YamNet.
Data Efficiency: Remarkably, Wav2CLIP requires only about 10% of the data to achieve competitive results on downstream tasks versus fully supervised benchmarks.
Retrieval and Captioning: In cross-modal retrieval (VGGSound dataset) and audio captioning (Clotho dataset), Wav2CLIP demonstrates its versatility by framing audio queries in terms of image and text, respectively.

Implications and Future Directions

The research identifies several promising avenues for future exploration. One potential direction is refining the loss functions and projection layers, particularly to enhance handling of multimodal data. Another is expanding the range of Wav2CLIP's generative capabilities, including audio synthesis from visual or textual prompts.

The practical implications of Wav2CLIP are profound. By efficiently projecting audio into a shared semantic space with images and text, this approach holds potential applications in surveillance, automated content generation, and multimedia retrieval systems. Furthermore, the open-sourcing of the model and its weights encourages community engagement and exploration for diverse applications beyond the scope of this paper.

Overall, Wav2CLIP exemplifies a step forward in multimodal learning by effectively adapting principles from contrastive multimodal representation into the auditory domain, fostering potential advancement in various cross-disciplinary AI tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Ho-Hsiang Wu (12 papers)
Prem Seetharaman (26 papers)
Kundan Kumar (55 papers)
Juan Pablo Bello (29 papers)

Citations (235)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos