Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation (2309.17352v2)
Abstract: Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this work, we strive to improve the performance of seq2seq AAC models by extensively leveraging pretrained models and LLMs. Specifically, we utilize BEATs to extract fine-grained audio features. Then, we employ Instructor LLM to fetch text embeddings of captions, and infuse their language-modality knowledge into BEATs audio features via an auxiliary InfoNCE loss function. Moreover, we propose a novel data augmentation method that uses ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increase not only the amount but also the complexity and diversity of training data. During inference, we propose to employ nucleus sampling and a hybrid reranking algorithm, which has not been explored in AAC research. Combining our efforts, our model achieves a new state-of-the-art 32.6 SPIDEr-FL score on the Clotho evaluation split, and wins the 2023 DCASE AAC challenge.
- K. Drossos, S. Adavanne and T. Virtanen, “Automated audio captioning with recurrent neural networks,” in Proc. WASPAA, 2017.
- K. Drossos, S. Lipping and T. Virtanen, “Clotho: an audio captioning dataset,” in Proc. ICASSP, 2020.
- “AudioCaps: Generating captions for audios in the wild,” in Proc. NAACL-HLT, 2019.
- “The DCASE 2021 challenge task 6 system: Automated audio captioning with weakly supervised pre-training and word selection methods,” Tech. Rep., DCASE Challenge, 2021.
- “The SJTU system for DCASE2022 challenge task 6: Audio captioning with audio-text retrieval pre-training,” Tech. Rep., DCASE Challenge, 2022.
- “Automated audio captioning with multi-task learning,” Tech. Rep., DCASE Challenge, 2022.
- “HYU submission for the DCASE 2023 task 6a: Automated audio captioning model using AL-MixGen and synonyms substitution,” Tech. Rep., DCASE Challenge, 2023.
- E. Labbé, T. Pellegrini and J. Pinquier, “IRIT-UPS DCASE 2023 audio captioning and retrieval system,” Tech. Rep., DCASE Challenge, 2023.
- “Leveraging state-of-the-art ASR techniques to audio captioning.,” in Proc. DCASE, 2021.
- “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020.
- “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in Proc. ICASSP, 2022.
- “Masked autoencoders that listen,” in Proc. NeurIPS, 2022.
- “BEATs: Audio pre-training with acoustic tokenizers,” arXiv preprint arXiv:2212.09058, 2022.
- “Audio Set: An ontology and human-labeled dataset for audio events,” in Proc. ICASSP, 2017.
- “Training language models to follow instructions with human feedback,” in Proc. NeurIPS, 2022.
- “LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- “One embedder, any task: Instruction-finetuned text embeddings,” Findings of ACL, 2023.
- “Introducing ChatGPT,” OpenAI Blog, 2022.
- A. van den Oord, Y. Li and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020.
- “Convolution-augmented transformer for semi-supervised sound event detection,” in Proc. DCASE, 2020.
- “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019.
- “mixup: Beyond empirical risk minimization,” in Proc. ICLR, 2018.
- “Efficient audio captioning transformer with patchout and text guidance,” Tech. Rep., DCASE Challenge, 2022.
- “WavCaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv preprint arXiv:2303.17395, 2023.
- “The curious case of neural text degeneration,” in Proc. ICLR, 2020.
- “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proc. ACL, 2020.
- “BERT: Pre-training of deep bidirectional Transformers for language understanding,” in Proc. NAACL-HLT, 2019.
- “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, 2020.
- “MTEB: Massive text embedding benchmark,” arXiv preprint arXiv:2210.07316, 2022.
- Y. Gong, Y.-A. Chung and J. Glass, “PSLA: Improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
- “Can audio captions be evaluated with image caption metrics?,” in Proc. ICASSP, 2022.
- “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
- “Self-critical sequence training for image captioning,” in Proc. CVPR, 2017.
- “A simple framework for contrastive learning of visual representations,” in Proc. ICML, 2020.
- “An encoder-decoder based audio captioning system with transfer and reinforcement learning,” in Proc. DCASE, 2021.
- “CLAP: learning audio concepts from natural language supervision,” in Proc. ICASSP, 2023.
- “ImageBind: holistic AI learning across six modalities,” Meta AI Blog, 2023.
- G. O. dos Santos, E. L. Colombini and S. Avila, “CIDEr-R: Robust consensus-based image description evaluation,” in Proc. Workshop on Noisy User-generated Text, 2021.
- “CLIPScore: A reference-free evaluation metric for image captioning,” in Proc. EMNLP, 2021.
- Shih-Lun Wu (16 papers)
- Xuankai Chang (61 papers)
- Gordon Wichern (51 papers)
- Jee-weon Jung (69 papers)
- François Germain (3 papers)
- Jonathan Le Roux (82 papers)
- Shinji Watanabe (416 papers)