Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios (2305.12397v2)

Published 21 May 2023 in cs.CV

Abstract: Audio-visual question answering (AVQA) is a challenging task that requires multistep spatio-temporal reasoning over multimodal contexts. Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding. This paper proposes a new target-aware joint spatio-temporal grounding network for AVQA. It consists of two key components: the target-aware spatial grounding module (TSG) and the single-stream joint audio-visual temporal grounding module (JTG). The TSG can focus on audio-visual cues relevant to the query subject by utilizing explicit semantics from the question. Unlike previous two-stream temporal grounding modules that required an additional audio-visual fusion module, JTG incorporates audio-visual fusion and question-aware temporal grounding into one module with a simpler single-stream architecture. The temporal synchronization between audio and video in the JTG is facilitated by our proposed cross-modal synchrony loss (CSL). Extensive experiments verified the effectiveness of our proposed method over existing state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Self-supervised learning of audio-visual objects from video. In ECCV, pages 208–224.
  2. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 34:24206–24221.
  3. Multimodal clustering networks for self-supervised learning from unlabeled videos. In ICCV, pages 8012–8021.
  4. Learning to ground visual objects for visual dialog. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1081–1091.
  5. Multi-modal transformer for video retrieval. In ECCV, pages 214–229.
  6. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, pages 776–780.
  7. Deep residual learning for image recognition. In CVPR, pages 770–778.
  8. Discriminative sounding objects localization via self-supervised audiovisual matching. NeurIPS, 33:10077–10087.
  9. Mix and localize: Localizing sound sources in mixtures. In CVPR, pages 10483–10492.
  10. Tvqa: Localized, compositional video question answering. In EMNLP, pages 1369–1379.
  11. Learning to answer questions in dynamic audio-visual scenarios. In CVPR, pages 19108–19118.
  12. Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual transformer with instance attention for audio-visual event localization. In ACCV.
  13. Vision transformers are parameter-efficient audio-visual learners. In CVPR.
  14. Dense modality interaction network for audio-visual event localization. TMM, pages 1–1.
  15. Audio-visual generalised zero-shot learning with cross-modal attention and language. In CVPR, pages 10553–10563.
  16. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  17. Taiki Miyanishi and Motoaki Kawanabe. 2021. Watch, listen, and answer: Open-ended videoqa with modulated multi-stream 3d convnets. In EUSIPCO, pages 706–710.
  18. Avlnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199.
  19. Imagenet large scale visual recognition challenge. IJCV, 115:211–252.
  20. A simple baseline for audio-visual scene-aware dialog. In CVPR, pages 12548–12558.
  21. Language-guided audio-visual source separation via trimodal consistency. In CVPR.
  22. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In ECCV, pages 436–454.
  23. Mirtt: Learning multimodal interaction representations from trilinear transformers for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2280–2292.
  24. Dual attention matching for audio-visual event localization. In ICCV, pages 6292–6300.
  25. Cross-modal attention network for temporal inconsistent audio-visual event localization. In AAAI, volume 34, pages 279–286.
  26. Avqa: A dataset for audio-visual question answering on videos. In ACM MM, pages 3480–3491.
  27. Self-supervised contrastive cross-modality representation learning for spoken question answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 28–39.
  28. Pano-avqa: Grounded audio-visual question answering on 360deg videos. In ICCV, pages 2031–2041.
  29. Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, pages 16375–16387.
  30. Positive sample propagation along the audio-visual event line. In CVPR, pages 8436–8444.
  31. Describing unseen videos via multi-modal cooperative dialog agents. In ECCV, pages 153–169.
  32. Multichannel attention refinement for video question answering. TOMM, 16(1s):1–23.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yuanyuan Jiang (8 papers)
  2. Jianqin Yin (53 papers)
Citations (6)