A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds (2403.04594v1)
Abstract: Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound event labels. Based on the analysis, we propose an automatic pipeline for curating audio-text pairs with rich details. Leveraging the property that sounds can be mixed and concatenated in the time domain, we control details in four aspects: temporal relationship, loudness, speaker identity, and occurrence number, in simulating audio mixtures. Corresponding details are transformed into captions by LLMs. Audio-text pairs with rich details in text descriptions are thereby obtained. We validate the effectiveness of our pipeline with a small amount of simulated data, demonstrating that the simulated data enables models to learn detailed audio captioning.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. ICML, 2021, pp. 8748–8763.
- J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning, 2022, pp. 12 888–12 900.
- B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
- K. Drossos, S. Adavanne, and T. Virtanen, “Automated audio captioning with recurrent neural networks,” in Proc. IEEE WASPAA, 2017, pp. 374–378.
- G. Li, X. Xu, L. Dai, M. Wu, and K. Yu, “Diverse and vivid sound generation from text descriptions,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
- A.-M. Oncescu, A. Koepke, J. F. Henriques, Z. Akata, and S. Albanie, “Audio retrieval with natural language queries,” in Proc. ISCA Interspeech, 2021, pp. 2411–2415.
- H. M. Fayek and J. Johnson, “Temporal reasoning via audio question answering,” IEEE/ACM TASLP, vol. 28, pp. 2283–2294, 2020.
- C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proc. NAACL, 2019, pp. 119–132.
- K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in Proc. IEEE ICASSP, 2020, pp. 736–740.
- I. Martin and A. Mesaros, “Diversity and bias in audio captioning datasets,” in Proc. DCASE, 2021, pp. 90–94.
- Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
- X. Xu, Z. Zhang, Z. Zhou, P. Zhang, Z. Xie, M. Wu, and K. Q. Zhu, “Blat: Bootstrapping language-audio pre-training based on audioset tag-guided synthetic data,” arXiv preprint arXiv:2303.07902, 2023.
- X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv preprint arXiv:2303.17395, 2023.
- H.-H. Wu, O. Nieto, J. P. Bello, and J. Salomon, “Audio-text models do not yet leverage natural language,” in Proc. IEEE ICASSP. IEEE, 2023, pp. 1–5.
- J. Huang, Y. Ren, R. Huang, D. Yang, Z. Ye, C. Zhang, J. Liu, X. Yin, Z. Ma, and Z. Zhao, “Make-an-audio 2: Temporal-enhanced text-to-audio generation,” arXiv preprint arXiv:2305.18474, 2023.
- Z. Xie, X. Xu, M. Wu, and K. Yu, “Enhance temporal relations in audio captioning with sound event detection,” in Proc. ISCA Interspeech, 2023, pp. 4179–4183.
- F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in Proc. ACM MM, 2013, pp. 411–412.
- F. Landini, A. Lozano-Diez, M. Diez, and L. Burget, “From simulated mixtures to simulated conversations as training data for end-to-end neural diarization,” in Proc. ISCA Interspeech, 2022, pp. 5095–5099.
- J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP, 2017, pp. 776–780.
- H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large-scale audio-visual dataset,” in Proc. IEEE ICASSP, 2020, pp. 721–725.
- X. Xu, M. Wu, and K. Yu, “Investigating pooling strategies and loss functions for weakly-supervised text-to-audio grounding via contrastive learning,” in Proc. IEEE ICASSPW, 2023, pp. 1–5.
- X. Xu, Z. Xie, M. Wu, and K. Yu, “The SJTU system for DCASE2022 challenge task 6: Audio captioning with audio-text retrieval pre-training,” DCASE2022 Challenge, Tech. Rep., 2022.
- Xuenan Xu (29 papers)
- Xiaohang Xu (8 papers)
- Zeyu Xie (14 papers)
- Pingyue Zhang (7 papers)
- Mengyue Wu (57 papers)
- Kai Yu (201 papers)