Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Answer Questions in Dynamic Audio-Visual Scenarios (2203.14072v2)

Published 26 Mar 2022 in cs.CV

Abstract: In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning. Code and dataset: http://gewu-lab.github.io/MUSIC-AVQA/

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Guangyao Li (37 papers)
  2. Yake Wei (15 papers)
  3. Yapeng Tian (80 papers)
  4. Chenliang Xu (114 papers)
  5. Ji-Rong Wen (299 papers)
  6. Di Hu (88 papers)
Citations (100)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub