Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information (2110.06100v1)

Published 12 Oct 2021 in cs.SD, cs.MM, and eess.AS

Abstract: Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the neural encoder-decoder architecture, and their decoder mainly uses acoustic information that is extracted from the CNN-based encoder. However, they have ignored semantic information that could help the AAC model to generate meaningful descriptions. This paper proposes a novel approach for automated audio captioning based on incorporating semantic and acoustic information. Specifically, our audio captioning model consists of two sub-modules. (1) The pre-trained keyword encoder utilizes pre-trained ResNet38 to initialize its parameters, and then it is trained by extracted keywords as labels. (2) The multi-modal attention decoder adopts an LSTM-based decoder that contains semantic and acoustic attention modules. Experiments demonstrate that our proposed model achieves state-of-the-art performance on the Clotho dataset. Our code can be found at https://github.com/WangHelin1997/DCASE2021_Task6_PKU

View on arXiv

Authors (4)

Zhongjie Ye (5 papers)
Helin Wang (35 papers)
Dongchao Yang (51 papers)
Yuexian Zou (119 papers)

Citations (25)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information (2110.06100v1)

Summary

Related Papers