AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations (2405.11093v2)
Abstract: Multi-modal learning in the audio-language domain has seen significant advancements in recent years. However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks. Existing audio-language datasets are notably smaller, and manual labeling is hindered by the need to listen to entire audio clips for accurate labeling. Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations. Leveraging a LLM, we generate descriptions of augmented audio clips with a prompt template. This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models. Integration of our dataset improves models performance on benchmarks by providing diversified and better-aligned examples. Notably, our dataset addresses the absence of modifiers (adjectives and adverbs) in existing datasets. By enabling models to learn these concepts, and generating hard negative examples during training, we achieve state-of-the-art performance on multiple benchmarks.
- Musiclm: Generating music from text, 2023.
- Vqa: Visual question answering, 2016.
- Test of time: Instilling video-language models with a sense of time, 2023.
- Generalization in nli: Ways (not) to go beyond simple heuristics, 2021.
- Language models are few-shot learners, 2020.
- Generating visually aligned sound from videos. IEEE Transactions on Image Processing, 29:8292–8302, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Clotho: An audio captioning dataset, 2019.
- Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
- Automated audio captioning by fine-tuning bart with audioset tags. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), pages 170–174, Barcelona, Spain, November 2021.
- The benefit of temporally-strong labels in audio event classification, 2021.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models, 2023.
- Large-scale representation learning from visually grounded untranscribed speech, 2019.
- Audiocaps: Generating captions for audios in the wild. In NAACL-HLT, 2019.
- Audio retrieval with natural language queries: A benchmark study. IEEE Transactions on Multimedia, 25:2675–2685, 2023.
- Panns: Large-scale pretrained audio neural networks for audio pattern recognition, 2020.
- Audiogen: Textually guided audio generation, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
- Oscar: Object-semantics aligned pre-training for vision-language tasks, 2020.
- Microsoft coco: Common objects in context, 2015.
- Audioldm: Text-to-audio generation with latent diffusion models, 2023.
- Audioldm 2: Learning holistic audio generation with self-supervised pretraining, 2023.
- Roberta: A robustly optimized bert pretraining approach, 2019.
- What is the ground truth? reliability of multi-annotator data for audio tagging, 2021.
- Automated audio captioning: an overview of recent progress and new challenges. EURASIP Journal on Audio, Speech, and Music Processing, 2022(1), October 2022.
- On metric learning for audio-text cross-modal retrieval, 2022.
- Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research, 2023.
- Language models are unsupervised multitask learners. 2019.
- Hierarchical text-conditional image generation with clip latents, 2022.
- High-resolution image synthesis with latent diffusion models, 2022.
- Representation learning with contrastive predictive coding, 2019.
- Audit: Audio editing by following instructions with latent diffusion models, 2023.
- Audio-text models do not yet leverage natural language, 2023.
- Language-based audio retrieval task in dcase 2022 challenge, 2022.
- The SJTU system for DCASE2022 challenge task 6: Audio captioning with audio-text retrieval pre-training. Technical report, DCASE2022 Challenge, July 2022.
- Diffsound: Discrete diffusion model for text-to-sound generation, 2023.