Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention (2210.16428v3)

Published 28 Oct 2022 in eess.AS, cs.AI, cs.MM, and cs.SD

Abstract: Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Xubo Liu (66 papers)
  2. Qiushi Huang (23 papers)
  3. Xinhao Mei (24 papers)
  4. Haohe Liu (59 papers)
  5. Qiuqiang Kong (86 papers)
  6. Jianyuan Sun (11 papers)
  7. Shengchen Li (21 papers)
  8. Tom Ko (31 papers)
  9. Yu Zhang (1400 papers)
  10. Lilian H. Tang (1 paper)
  11. Mark D. Plumbley (114 papers)
  12. Volkan Kılıç (8 papers)
  13. Wenwu Wang (148 papers)
Citations (17)

Summary

We haven't generated a summary for this paper yet.