Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios (2305.12397v2)
Abstract: Audio-visual question answering (AVQA) is a challenging task that requires multistep spatio-temporal reasoning over multimodal contexts. Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding. This paper proposes a new target-aware joint spatio-temporal grounding network for AVQA. It consists of two key components: the target-aware spatial grounding module (TSG) and the single-stream joint audio-visual temporal grounding module (JTG). The TSG can focus on audio-visual cues relevant to the query subject by utilizing explicit semantics from the question. Unlike previous two-stream temporal grounding modules that required an additional audio-visual fusion module, JTG incorporates audio-visual fusion and question-aware temporal grounding into one module with a simpler single-stream architecture. The temporal synchronization between audio and video in the JTG is facilitated by our proposed cross-modal synchrony loss (CSL). Extensive experiments verified the effectiveness of our proposed method over existing state-of-the-art methods.
- Self-supervised learning of audio-visual objects from video. In ECCV, pages 208–224.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 34:24206–24221.
- Multimodal clustering networks for self-supervised learning from unlabeled videos. In ICCV, pages 8012–8021.
- Learning to ground visual objects for visual dialog. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1081–1091.
- Multi-modal transformer for video retrieval. In ECCV, pages 214–229.
- Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, pages 776–780.
- Deep residual learning for image recognition. In CVPR, pages 770–778.
- Discriminative sounding objects localization via self-supervised audiovisual matching. NeurIPS, 33:10077–10087.
- Mix and localize: Localizing sound sources in mixtures. In CVPR, pages 10483–10492.
- Tvqa: Localized, compositional video question answering. In EMNLP, pages 1369–1379.
- Learning to answer questions in dynamic audio-visual scenarios. In CVPR, pages 19108–19118.
- Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual transformer with instance attention for audio-visual event localization. In ACCV.
- Vision transformers are parameter-efficient audio-visual learners. In CVPR.
- Dense modality interaction network for audio-visual event localization. TMM, pages 1–1.
- Audio-visual generalised zero-shot learning with cross-modal attention and language. In CVPR, pages 10553–10563.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Taiki Miyanishi and Motoaki Kawanabe. 2021. Watch, listen, and answer: Open-ended videoqa with modulated multi-stream 3d convnets. In EUSIPCO, pages 706–710.
- Avlnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199.
- Imagenet large scale visual recognition challenge. IJCV, 115:211–252.
- A simple baseline for audio-visual scene-aware dialog. In CVPR, pages 12548–12558.
- Language-guided audio-visual source separation via trimodal consistency. In CVPR.
- Unified multisensory perception: Weakly-supervised audio-visual video parsing. In ECCV, pages 436–454.
- Mirtt: Learning multimodal interaction representations from trilinear transformers for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2280–2292.
- Dual attention matching for audio-visual event localization. In ICCV, pages 6292–6300.
- Cross-modal attention network for temporal inconsistent audio-visual event localization. In AAAI, volume 34, pages 279–286.
- Avqa: A dataset for audio-visual question answering on videos. In ACM MM, pages 3480–3491.
- Self-supervised contrastive cross-modality representation learning for spoken question answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 28–39.
- Pano-avqa: Grounded audio-visual question answering on 360deg videos. In ICCV, pages 2031–2041.
- Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, pages 16375–16387.
- Positive sample propagation along the audio-visual event line. In CVPR, pages 8436–8444.
- Describing unseen videos via multi-modal cooperative dialog agents. In ECCV, pages 153–169.
- Multichannel attention refinement for video question answering. TOMM, 16(1s):1–23.
- Yuanyuan Jiang (8 papers)
- Jianqin Yin (53 papers)