Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Audio Captions Be Evaluated with Image Caption Metrics? (2110.04684v2)

Published 10 Oct 2021 in cs.SD, cs.CL, and eess.AS

Abstract: Automated audio captioning aims at generating textual descriptions for an audio clip. To evaluate the quality of generated audio captions, previous works directly adopt image captioning metrics like SPICE and CIDEr, without justifying their suitability in this new domain, which may mislead the development of advanced models. This problem is still unstudied due to the lack of human judgment datasets on caption quality. Therefore, we firstly construct two evaluation benchmarks, AudioCaps-Eval and Clotho-Eval. They are established with pairwise comparison instead of absolute rating to achieve better inter-annotator agreement. Current metrics are found in poor correlation with human annotations on these datasets. To overcome their limitations, we propose a metric named FENSE, where we combine the strength of Sentence-BERT in capturing similarity, and a novel Error Detector to penalize erroneous sentences for robustness. On the newly established benchmarks, FENSE outperforms current metrics by 14-25% accuracy. Code, data and web demo available at: https://github.com/blmoistawinde/fense

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zelin Zhou (6 papers)
  2. Zhiling Zhang (12 papers)
  3. Xuenan Xu (29 papers)
  4. Zeyu Xie (14 papers)
  5. Mengyue Wu (57 papers)
  6. Kenny Q. Zhu (50 papers)
Citations (39)

Summary

We haven't generated a summary for this paper yet.