Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning (2309.11500v4)

Published 20 Sep 2023 in cs.SD, cs.CV, cs.MM, and eess.AS

Abstract: Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train widely used models on our dataset and show performance improvement on various downstream tasks, for example, audio-language retrieval, audio captioning, zero-shot classification. In addition, we establish a novel benchmark with environmental information and provide a benchmark for audio-text tasks.

A Large-Scale Dataset for Audio-Language Representation Learning: Auto-ACD

The presented paper introduces a novel and automated methodology for constructing a large-scale audio-language dataset, termed Auto-ACD, comprising over 1.9 million audio-text pairs. This dataset aims to circumvent the limitations of existing audio-language datasets which often suffer from constrained volume, simplistic linguistic content, and labor-intensive collection processes. Auto-ACD is a significant contribution that augments the quality and scale of existing resources available in this domain.

Contributions

The authors develop an innovative audio caption generation pipeline utilizing publicly accessible tools and APIs. The generated dataset serves to bridge the gap in large-scale audio-language resources, enhancing model training in various audio-centric tasks such as audio-language retrieval, audio captioning, and environment classification.

Methodology

The dataset collection relies on the intrinsic correlation between audio and visual data present in multimodal datasets such as VGGSound and AudioSet. By leveraging advanced models from the general AI community, encompassing vision, language, and audio domains, the authors automate the extraction of rich textual descriptions from audio content. This innovative automation contrasts existing datasets like Clotho and AudioCaps, which rely primarily on human annotations and are hence harder to scale.

The approach comprises several steps utilizing tools like BLIP-2 for image captioning and Grounding DINO for object detection to provide context. PANNs are employed for audio tagging, while ChatGPT is utilized for generating coherent and meaningful audio descriptions that capture a wide array of auditory attributes and environmental contexts.

Numerical Results and Benchmarks

Evaluations performed on models trained with Auto-ACD demonstrate significant improvements in audio-language retrieval, setting new baselines for this task. Notably, audio-LLMs fine-tuned on Auto-ACD outperform earlier datasets in both recall metrics across AudioCaps, Clotho, and the newly introduced Auto-ACD benchmark. This benchmark not only validates the high-quality annotations in Auto-ACD but also emphasizes the importance of rich environmental context in facilitating robust audio understanding.

Furthermore, experiments in automatic audio captioning reveal that models with audio backbones trained on Auto-ACD datasets show noticeable enhancements. This is corroborated by various metrics such as Meteor, RougeL, and Spider, suggesting a greater capacity for the models to capture and represent nuanced audio phenomena.

Implications and Future Directions

This work extensively highlights the significance of comprehensive and automated dataset creation in transitioning towards effective data-centric model training. The scalability and quality of datasets like Auto-ACD illustrate the potential for large-scale automatised dataset generation to transform representation learning in audio-language tasks.

Looking forward, Auto-ACD sets a precedent for future research to explore further integration of multimodal data. Future developments could aim at refining the captioning process with newer LLMs, improving context capture, and extending the methodology to cover other multimodal applications such as video-language generation.

In conclusion, the development and release of Auto-ACD represent a substantial advancement in the field of audio-language representation learning, potentially driving forward applications and models that require large, diverse, and contextually rich training data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. ICML, 2021, pp. 8748–8763.
  2. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  3. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
  4. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF CVPR, 2022, pp. 10 684–10 695.
  5. W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre, J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y. Wang, and Y. Choi, “Multimodal C4: An open, billion-scale corpus of images interleaved with text,” arXiv preprint arXiv:2304.06939, 2023.
  6. C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” in Proc. NIPS, vol. 35, 2022, pp. 25 278–25 294.
  7. K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in Proc. IEEE ICASSP.   IEEE, 2020, pp. 736–740.
  8. C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in NAACL, 2019, pp. 119–132.
  9. Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
  10. X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv preprint arXiv:2303.17395, 2023.
  11. H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large-scale audio-visual dataset,” in Proc. IEEE ICASSP, 2020, pp. 721–725.
  12. J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP, 2017, pp. 776–780.
  13. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  14. T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in Proc. ICCVW, 2019, pp. 1–13.
  15. J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” in ICML, 2023, pp. 1–13.
  16. S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
  17. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE/CVF CVPR, 2009, pp. 248–255.
  18. B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE TPAMI, vol. 40, no. 6, pp. 1452–1464, 2017.
  19. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM TASLP, vol. 28, pp. 2880–2894, 2020.
  20. X. Xu, Z. Zhang, Z. Zhou, P. Zhang, Z. Xie, M. Wu, and K. Q. Zhu, “Blat: Bootstrapping language-audio pre-training based on audioset tag-guided synthetic data,” arXiv preprint arXiv:2303.07902, 2023.
  21. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. ICML.   PMLR, 2021, pp. 8748–8763.
  22. K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in Proc. IEEE ICASSP, 2022, pp. 646–650.
  23. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  24. R. Mokady, A. Hertz, and A. H. Bermano, “Clipcap: Clip prefix for image captioning,” arXiv preprint arXiv:2111.09734, 2021.
  25. T. Han, M. Bain, A. Nagrani, G. Varol, W. Xie, and A. Zisserman, “Autoad: Movie description in context,” in Proc. IEEE/CVF CVPR, 2023, pp. 18 930–18 940.
  26. I. Martin Morato and A. Mesaros, “Diversity and bias in audio captioning datasets,” in Proc. DCASE, 2021, pp. 90–94.
  27. S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in ACL Workshop on MT, 2005, pp. 65–72.
  28. C.-Y. Lin and E. Hovy, “Automatic evaluation of summaries using n-gram co-occurrence statistics,” in NAACL, 2003, pp. 150–157.
  29. S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of spider,” in Proc. IEEE/CVF CVPR, 2017, pp. 873–881.
  30. N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proc. EMNLP, 2019, pp. 3982–3992.
  31. T. Heittola, A. Mesaros, and T. Virtanen, “Tau urban acoustic scenes 2020 mobile, development dataset,” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.3670167
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Luoyi Sun (2 papers)
  2. Xuenan Xu (29 papers)
  3. Mengyue Wu (57 papers)
  4. Weidi Xie (132 papers)
Citations (17)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com