EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning (2401.17690v1)

Published 31 Jan 2024 in eess.AS, cs.AI, and cs.SD

Abstract: We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained LLM, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained LLM. Experimental results on AudioCaps and Clotho demonstrate that our model surpasses the performance of baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP . An online demo is available at https://huggingface.co/spaces/enclap-team/enclap .

PDF Abstract

Overview of "EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning"

This paper introduces EnCLAP, a novel framework for automated audio captioning (AAC) that leverages advancements in acoustic representation and LLMing. EnCLAP integrates two acoustic encoders—EnCodec and CLAP—to extract meaningful features from audio inputs, and employs a pretrained LLM, BART, as the caption generator. The authors propose a unique training objective, masked codec modeling (MCM), designed to enhance the acoustic awareness of BART. The framework's efficacy is demonstrated by surpassing baseline performances on prevalent AAC datasets, AudioCaps and Clotho.

Key Components and Methodology

Acoustic Representation Models:
- EnCodec: A convolutional auto-encoder that processes audio signals into discrete neural codec sequences using residual vector quantization (RVQ). The discrete codes are more comprehensible for the LLMs than continuous audio spectrograms.
- CLAP: A contrastive learning framework that aligns audio and text in a joint embedding space, enhancing the semantic understanding of the audio input by providing sequence-level embeddings.
Caption Generation:
- EnCLAP employs a BART model, a transformer-based LLM initially trained for various natural language processing tasks, here tuned with audio-related data to generate captions.
- The BART model combines the discrete-time step features from EnCodec with the sequence-level semantics from CLAP to generate text descriptions of audio inputs.
Masked Codec Modeling:
- MCM is an auxiliary task integrated into the training process. This involves masking portions of the EnCodec outputs and training the model to predict the masked sections. The strategy is akin to masked LLMing used in NLP for models like BERT, promoting better contextual understanding of audio signals within BART.

Experimental Results

The paper presents a series of experiments using AudioCaps and Clotho datasets to evaluate the EnCLAP framework:

Performance Metrics: The model's efficacy was gauged using standard AAC evaluation metrics: METEOR, CIDEr, SPICE, and SPIDEr.
Numerical Results: EnCLAP achieved state-of-the-art results on AudioCaps, with the large variant of the model outperforming all baselines, including those that employ extensive pretraining or large external datasets like WavCaps.
Comparison on Clotho: EnCLAP demonstrated enhanced performance over baseline models in both setup settings—exclusive training on Clotho and pretraining on AudioCaps followed by Clotho finetuning.

Implications and Future Work

EnCLAP's integration strategy addresses key challenges in AAC such as the scarcity of training data and the complexity of converting audio signals into meaningful text representations. By leveraging pretrained models for distinct components of the task, EnCLAP reduces the gap between machine-generated and human-authored captions.

Theoretical implications of this research include insights into AAC as a cross-modal task, emphasizing the importance of multimodal embeddings and discrete feature representations. Practically, the improvements in AAC suggest robust potential applications in accessibility technologies and content indexing, where accurate audio descriptions are vital.

For future research, the paper suggests expanding EnCLAP to encompass additional audio-related tasks such as music captioning and audio generation. This extension could enhance our understanding of cross-modal learning and its applicability in broader AI domains.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jaeyeon Kim (42 papers)
Jaeyoon Jung (7 papers)
Jinjoo Lee (3 papers)
Sang Hoon Woo (5 papers)

Citations (16)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - jaeyeonkim99/EnCLAP: Official Implementation of EnCLAP (92 stars)