Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering (2404.12020v3)
Abstract: Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably obtaining a significant improvement of 9.32%. Extensive ablation experiments are conducted on the two datasets mentioned to analyze the component effectiveness within the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset. We also conduct experiments combining various baselines with our proposed strategy on two datasets to verify its plug-and-play capability. Our dataset and code are available at https://github.com/reml-group/MUSIC-AVQA-R.
- Counterfactual vision and language learning. In CVPR, pages 10044–10054.
- Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR, pages 4971–4980.
- Audio visual scene-aware dialog. In CVPR, pages 7558–7567.
- VQA: Visual question answering. In ICCV, pages 2425–2433.
- RUBi: Reducing unimodal biases for visual question answering. In NeurIPS, pages 841–852.
- Rethinking data augmentation for robust visual question answering. In ECCV, pages 95–112.
- Generative bias for robust visual question answering. In CVPR, pages 11681–11690.
- Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. In ICCV, pages 1574–1583.
- Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR, pages 1999–2007.
- Haytham M Fayek and Justin Johnson. 2020. Temporal reasoning via audio question answering. TASLP, 28:2283–2294.
- Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. In NeurIPS, pages 3197–3208.
- Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, pages 776–780.
- ImageBind: One embedding space to bind them all. In CVPR, pages 15180–15190.
- Making the v in VQA matter: Elevating the role of image understanding in visual question answering. IJCV, 127:398–414.
- Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in medicine, 15(4):361–387.
- Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738.
- Drew A Hudson and Christopher D Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709.
- Overcoming language priors in VQA via decomposed linguistic representations. In AAAI, pages 11181–11188.
- Roses are red, violets are blue… but should VQA expect them to? In CVPR, pages 2776–2785.
- Look at the first sentence: Position bias in question answering. In EMNLP, pages 1109–1121.
- Gouthaman Kv and Anurag Mittal. 2020. Reducing language biases in visual question answering with visually-grounded question encoder. In ECCV, pages 18–34.
- COCA: Collaborative causal regularization for audio-visual question answering. In AAAI, pages 12995–13003.
- Hierarchical conditional relation networks for video question answering. In CVPR, pages 9972–9981.
- Learning to answer questions in dynamic audio-visual scenarios. In CVPR, pages 19108–19118.
- Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, pages 9694–9705.
- Adversarial VQA: A new benchmark for evaluating the robustness of VQA models. In ICCV, pages 2042–2051.
- VisualBert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
- Beyond RNNs: Positional self-attention with co-attention for video question answering. In AAAI, pages 8658–8665.
- Mmcoqa: Conversational question answering over text, tables, and images. In ACL, pages 4220–4231.
- A multi-modal debiasing model with dynamical constraint for robust visual question answering. In Findings of ACL, pages 5032–5045.
- Learning to contrast the counterfactual samples for robust visual question answering. In EMNLP, pages 3285–3292.
- Vision transformers are parameter-efficient audio-visual learners. In CVPR, pages 2299–2309.
- ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, pages 13–23.
- Hierarchical question-image co-attention for visual question answering. In NeurIPS, pages 289–297.
- Robust visual question answering: Datasets, methods, and future challenges. arXiv preprint arXiv:2307.11471.
- The effect of natural distribution shift on question answering models. In ICML, pages 6905–6916.
- Yulei Niu and Hanwang Zhang. 2021. Introspective distillation for robust question answering. In NeurIPS, pages 16292–16304.
- Squad: 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383–2392.
- When and why does bias mitigation work? In Findings of EMNLP, pages 9233–9247.
- A simple baseline for audio-visual scene-aware dialog. In CVPR, pages 12548–12558.
- Human-adversarial visual question answering. In NeurIPS, pages 20346–20359.
- A negative case analysis of visual grounding methods for VQA. In ACL, pages 8172–8181.
- Check it again: Progressive visual question answering via visual entailment. In ACL, pages 4101–4110.
- Towards robust visual question answering: Making the most of biased samples via contrastive learning. In Findings of EMNLP, pages 6650–6662.
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP, pages 5100–5111.
- Unshuffling data for improved generalization in visual question answering. In ICCV, pages 1417–1427.
- Debiased visual question answering from feature and sample perspectives. In NeurIPS, pages 3784–3796.
- Jialin Wu and Raymond J Mooney. 2019. Self-critical reasoning for robust visual question answering. In NeurIPS, pages 8604–8614.
- Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, pages 124–141.
- AVQA: A dataset for audio-visual question answering on videos. In ACM MM, pages 3480–3491.
- Deep modular co-attention networks for visual question answering. In CVPR, pages 6281–6290.
- Pano-AVQA: Grounded audio-visual question answering on 360∘{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT videos. In CVPR, pages 2031–2041.
- Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In EMNLP (Demos), pages 543–553.
- Attention-based bidirectional long short-term memory networks for relation classification. In ACL, pages 207–212.
- Overcoming language priors with self-supervised learning for visual question answering. In IJCAI, pages 1083–1089.
- Jie Ma (205 papers)
- Min Hu (18 papers)
- Pinghui Wang (49 papers)
- Wangchun Sun (2 papers)
- Lingyun Song (2 papers)
- Hongbin Pei (8 papers)
- Jun Liu (606 papers)
- Youtian Du (5 papers)