Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds (2403.18572v1)

Published 27 Mar 2024 in cs.SD and eess.AS

Abstract: Automated Audio Captioning is a multimodal task that aims to convert audio content into natural language. The assessment of audio captioning systems is typically based on quantitative metrics applied to text data. Previous studies have employed metrics derived from machine translation and image captioning to evaluate the quality of generated audio captions. Drawing inspiration from auditory cognitive neuroscience research, we introduce a novel metric approach -- Audio Captioning Evaluation on Semantics of Sound (ACES). ACES takes into account how human listeners parse semantic information from sounds, providing a novel and comprehensive evaluation perspective for automated audio captioning systems. ACES combines semantic similarities and semantic entity labeling. ACES outperforms similar automated audio captioning metrics on the Clotho-Eval FENSE benchmark in two evaluation categories.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. K. Drossos, S. Adavanne, and T. Virtanen, “Automated Audio Captioning with Recurrent Neural Networks,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), no. arXiv:1706.10006, Oct. 2017.
  2. X. Mei, X. Liu, M. D. Plumbley, and W. Wang, “Automated audio captioning: An overview of recent progress and new challenges,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2022, no. 1, p. 26, Oct. 2022.
  3. Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, “Can Audio Captions Be Evaluated with Image Caption Metrics?” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   arXiv, Jan. 2022.
  4. B. L. Giordano, R. de Miranda Azevedo, Y. Plasencia-Calaña, E. Formisano, and M. Dumontier, “What do we mean with sound semantics, exactly? A survey of taxonomies and ontologies of everyday sounds,” Frontiers in Psychology, vol. 13, 2022.
  5. H. C. Boas and R. Dux, “From the past into the present: From case frames to semantic frames,” Linguistics Vanguard, vol. 3, no. 1, p. 20160003, 2017.
  6. G. Wijngaard, E. Formisano, B. L. Giordano, and M. Dumontier, “Aces: Evaluating automated audio captioning models on the semantics of sounds,” in 2023 31st European Signal Processing Conference (EUSIPCO).   IEEE, 2023, pp. 770–774.
  7. X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick, “Microsoft COCO Captions: Data Collection and Evaluation Server,” Apr. 2015.
  8. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.   Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318.
  9. C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” in Text Summarization Branches Out.   Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81.
  10. S. Banerjee and A. Lavie, “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, J. Goldstein, A. Lavie, C.-Y. Lin, and C. Voss, Eds.   Ann Arbor, Michigan: Association for Computational Linguistics, Jun. 2005, pp. 65–72.
  11. R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr: Consensus-Based Image Description Evaluation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4566–4575.
  12. P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic Propositional Image Caption Evaluation,” in ECCV 2016.   Springer International Publishing, Jul. 2016.
  13. S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image Captioning via Policy Gradient optimization of SPIDEr,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 873–881.
  14. C.-k. Lo, A. K. Tumuluru, and D. Wu, “Fully Automatic Semantic MT Evaluation,” in Proceedings of the Seventh Workshop on Statistical Machine Translation.   Montréal, Canada: Association for Computational Linguistics, Jun. 2012, pp. 243–252.
  15. T. Sellam, D. Das, and A. Parikh, “BLEURT: Learning Robust Metrics for Text Generation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.   Online: Association for Computational Linguistics, Jul. 2020, pp. 7881–7892.
  16. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” in International Conference on Learning Representations, Feb. 2020.
  17. D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and Classification of Acoustic Scenes and Events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, Oct. 2015.
  18. F. Gontier, R. Serizel, and C. Cerisara, “SPICE+: Evaluation of Automatic Audio Captioning Systems With Pre-Trained Language Models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2022.
  19. K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An Audio Captioning Dataset,” Oct. 2019.
  20. “Prodigy ⋅⋅\cdot⋅ Prodigy ⋅⋅\cdot⋅ An annotation tool for AI, Machine Learning & NLP.”
  21. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-Art Natural Language Processing,” Association for Computational Linguistics, pp. 38–45, Oct. 2020.
  22. I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in ICLR 2019, Jan. 2019.
  23. N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-Networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.   Association for Computational Linguistics, Nov. 2019.
  24. L. Biewald, “Experiment tracking with weights and biases,” 2020.
  25. S. Gugger, L. Debut, T. Wolf, P. Schmid, Z. Mueller, S. Mangrulkar, M. Sun, and B. Bossan, “Accelerate: Training and inference at scale made simple, efficient and adaptable.” 2022.
  26. K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Association for Computational Linguistics, Sep. 2014.
  27. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, Aug. 2020.
  28. K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient Training of Audio Transformers with Patchout,” in Interspeech 2022, Mar. 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com