Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning (2309.11500v4)

Published 20 Sep 2023 in cs.SD, cs.CV, cs.MM, and eess.AS

Abstract: Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train widely used models on our dataset and show performance improvement on various downstream tasks, for example, audio-language retrieval, audio captioning, zero-shot classification. In addition, we establish a novel benchmark with environmental information and provide a benchmark for audio-text tasks.

PDF HTML Abstract

A Large-Scale Dataset for Audio-Language Representation Learning: Auto-ACD

The presented paper introduces a novel and automated methodology for constructing a large-scale audio-language dataset, termed Auto-ACD, comprising over 1.9 million audio-text pairs. This dataset aims to circumvent the limitations of existing audio-language datasets which often suffer from constrained volume, simplistic linguistic content, and labor-intensive collection processes. Auto-ACD is a significant contribution that augments the quality and scale of existing resources available in this domain.

Contributions

The authors develop an innovative audio caption generation pipeline utilizing publicly accessible tools and APIs. The generated dataset serves to bridge the gap in large-scale audio-language resources, enhancing model training in various audio-centric tasks such as audio-language retrieval, audio captioning, and environment classification.

Methodology

The dataset collection relies on the intrinsic correlation between audio and visual data present in multimodal datasets such as VGGSound and AudioSet. By leveraging advanced models from the general AI community, encompassing vision, language, and audio domains, the authors automate the extraction of rich textual descriptions from audio content. This innovative automation contrasts existing datasets like Clotho and AudioCaps, which rely primarily on human annotations and are hence harder to scale.

The approach comprises several steps utilizing tools like BLIP-2 for image captioning and Grounding DINO for object detection to provide context. PANNs are employed for audio tagging, while ChatGPT is utilized for generating coherent and meaningful audio descriptions that capture a wide array of auditory attributes and environmental contexts.

Numerical Results and Benchmarks

Evaluations performed on models trained with Auto-ACD demonstrate significant improvements in audio-language retrieval, setting new baselines for this task. Notably, audio-LLMs fine-tuned on Auto-ACD outperform earlier datasets in both recall metrics across AudioCaps, Clotho, and the newly introduced Auto-ACD benchmark. This benchmark not only validates the high-quality annotations in Auto-ACD but also emphasizes the importance of rich environmental context in facilitating robust audio understanding.

Furthermore, experiments in automatic audio captioning reveal that models with audio backbones trained on Auto-ACD datasets show noticeable enhancements. This is corroborated by various metrics such as Meteor, RougeL, and Spider, suggesting a greater capacity for the models to capture and represent nuanced audio phenomena.

Implications and Future Directions

This work extensively highlights the significance of comprehensive and automated dataset creation in transitioning towards effective data-centric model training. The scalability and quality of datasets like Auto-ACD illustrate the potential for large-scale automatised dataset generation to transform representation learning in audio-language tasks.

Looking forward, Auto-ACD sets a precedent for future research to explore further integration of multimodal data. Future developments could aim at refining the captioning process with newer LLMs, improving context capture, and extending the methodology to cover other multimodal applications such as video-language generation.

In conclusion, the development and release of Auto-ACD represent a substantial advancement in the field of audio-language representation learning, potentially driving forward applications and models that require large, diverse, and contextually rich training data.

PDF Markdown Bookmark Chat (Pro)

References (31)

Authors (4)

Luoyi Sun (2 papers)
Xuenan Xu (29 papers)
Mengyue Wu (57 papers)
Weidi Xie (132 papers)

Citations (17)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Auto-ACD

Tweets

https://twitter.com/WeidiXie/status/1851971720573522023

YouTube

Show All Videos