Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering (2404.12020v3)

Published 18 Apr 2024 in cs.CV

Abstract: Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably obtaining a significant improvement of 9.32%. Extensive ablation experiments are conducted on the two datasets mentioned to analyze the component effectiveness within the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset. We also conduct experiments combining various baselines with our proposed strategy on two datasets to verify its plug-and-play capability. Our dataset and code are available at https://github.com/reml-group/MUSIC-AVQA-R.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Counterfactual vision and language learning. In CVPR, pages 10044–10054.
  2. Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR, pages 4971–4980.
  3. Audio visual scene-aware dialog. In CVPR, pages 7558–7567.
  4. VQA: Visual question answering. In ICCV, pages 2425–2433.
  5. RUBi: Reducing unimodal biases for visual question answering. In NeurIPS, pages 841–852.
  6. Rethinking data augmentation for robust visual question answering. In ECCV, pages 95–112.
  7. Generative bias for robust visual question answering. In CVPR, pages 11681–11690.
  8. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. In ICCV, pages 1574–1583.
  9. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR, pages 1999–2007.
  10. Haytham M Fayek and Justin Johnson. 2020. Temporal reasoning via audio question answering. TASLP, 28:2283–2294.
  11. Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. In NeurIPS, pages 3197–3208.
  12. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, pages 776–780.
  13. ImageBind: One embedding space to bind them all. In CVPR, pages 15180–15190.
  14. Making the v in VQA matter: Elevating the role of image understanding in visual question answering. IJCV, 127:398–414.
  15. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in medicine, 15(4):361–387.
  16. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738.
  17. Drew A Hudson and Christopher D Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709.
  18. Overcoming language priors in VQA via decomposed linguistic representations. In AAAI, pages 11181–11188.
  19. Roses are red, violets are blue… but should VQA expect them to? In CVPR, pages 2776–2785.
  20. Look at the first sentence: Position bias in question answering. In EMNLP, pages 1109–1121.
  21. Gouthaman Kv and Anurag Mittal. 2020. Reducing language biases in visual question answering with visually-grounded question encoder. In ECCV, pages 18–34.
  22. COCA: Collaborative causal regularization for audio-visual question answering. In AAAI, pages 12995–13003.
  23. Hierarchical conditional relation networks for video question answering. In CVPR, pages 9972–9981.
  24. Learning to answer questions in dynamic audio-visual scenarios. In CVPR, pages 19108–19118.
  25. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, pages 9694–9705.
  26. Adversarial VQA: A new benchmark for evaluating the robustness of VQA models. In ICCV, pages 2042–2051.
  27. VisualBert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
  28. Beyond RNNs: Positional self-attention with co-attention for video question answering. In AAAI, pages 8658–8665.
  29. Mmcoqa: Conversational question answering over text, tables, and images. In ACL, pages 4220–4231.
  30. A multi-modal debiasing model with dynamical constraint for robust visual question answering. In Findings of ACL, pages 5032–5045.
  31. Learning to contrast the counterfactual samples for robust visual question answering. In EMNLP, pages 3285–3292.
  32. Vision transformers are parameter-efficient audio-visual learners. In CVPR, pages 2299–2309.
  33. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, pages 13–23.
  34. Hierarchical question-image co-attention for visual question answering. In NeurIPS, pages 289–297.
  35. Robust visual question answering: Datasets, methods, and future challenges. arXiv preprint arXiv:2307.11471.
  36. The effect of natural distribution shift on question answering models. In ICML, pages 6905–6916.
  37. Yulei Niu and Hanwang Zhang. 2021. Introspective distillation for robust question answering. In NeurIPS, pages 16292–16304.
  38. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383–2392.
  39. When and why does bias mitigation work? In Findings of EMNLP, pages 9233–9247.
  40. A simple baseline for audio-visual scene-aware dialog. In CVPR, pages 12548–12558.
  41. Human-adversarial visual question answering. In NeurIPS, pages 20346–20359.
  42. A negative case analysis of visual grounding methods for VQA. In ACL, pages 8172–8181.
  43. Check it again: Progressive visual question answering via visual entailment. In ACL, pages 4101–4110.
  44. Towards robust visual question answering: Making the most of biased samples via contrastive learning. In Findings of EMNLP, pages 6650–6662.
  45. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP, pages 5100–5111.
  46. Unshuffling data for improved generalization in visual question answering. In ICCV, pages 1417–1427.
  47. Debiased visual question answering from feature and sample perspectives. In NeurIPS, pages 3784–3796.
  48. Jialin Wu and Raymond J Mooney. 2019. Self-critical reasoning for robust visual question answering. In NeurIPS, pages 8604–8614.
  49. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, pages 124–141.
  50. AVQA: A dataset for audio-visual question answering on videos. In ACM MM, pages 3480–3491.
  51. Deep modular co-attention networks for visual question answering. In CVPR, pages 6281–6290.
  52. Pano-AVQA: Grounded audio-visual question answering on 360∘{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT videos. In CVPR, pages 2031–2041.
  53. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In EMNLP (Demos), pages 543–553.
  54. Attention-based bidirectional long short-term memory networks for relation classification. In ACL, pages 207–212.
  55. Overcoming language priors with self-supervised learning for visual question answering. In IJCAI, pages 1083–1089.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jie Ma (205 papers)
  2. Min Hu (18 papers)
  3. Pinghui Wang (49 papers)
  4. Wangchun Sun (2 papers)
  5. Lingyun Song (2 papers)
  6. Hongbin Pei (8 papers)
  7. Jun Liu (606 papers)
  8. Youtian Du (5 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com