Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual Grounding Methods for VQA are Working for the Wrong Reasons! (2004.05704v4)

Published 12 Apr 2020 in cs.CV, cs.AI, and cs.CL

Abstract: Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. VQD: Visual query detection in natural scenes. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1955–1961, Minneapolis, Minnesota. Association for Computational Linguistics.
  2. Tallyqa: Answering complex counting questions. In Association for the Advancement of Artificial Intelligence (AAAI).
  3. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4971–4980.
  4. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  5. VQA: Visual question answering. In The IEEE International Conference on Computer Vision (ICCV).
  6. Rubi: Reducing unimodal biases for visual question answering. In Advances in Neural Information Processing Systems (NeurIPS), pages 839–850.
  7. Human attention in visual question answering: Do humans and deep networks look at the same regions? In Conference on Empirical Methods on Natural Language Processing (EMNLP).
  8. Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding (CVIU), 163:90–100.
  9. Vqs: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1811–1820.
  10. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3.
  11. Gabriel Grand and Yonatan Belinkov. 2019. Adversarial regularization for visual question answering: Strengths, shortcomings, and side effects. In Proceedings of the Second Workshop on Shortcomings in Vision and Language, pages 1–13, Minneapolis, Minnesota. Association for Computational Linguistics (ACL).
  12. Drew A Hudson and Christopher D Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709.
  13. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8779–8788.
  14. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1988–1997. IEEE.
  15. Kushal Kafle and Christopher Kanan. 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1983–1991. IEEE.
  16. DVQA: Understanding data visualizations via question answering. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5648–5656.
  17. Challenges and prospects in vision and language research. Frontiers in Artificial Intelligence.
  18. Data augmentation for visual question answering. In Proceedings of the 10th International Conference on Natural Language Generation (INLG), pages 198–202.
  19. Overcoming language priors in visual question answering with adversarial regularization. In Advances in Neural Information Processing Systems (NeurIPS), pages 1541–1551.
  20. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS).
  21. Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2591–2600.
  22. Answer them all! toward universal visual question answering models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  23. Bernard L Welch. 1938. The significance of the difference between two means when the population variances are unequal. Biometrika, 29(3/4):350–362.
  24. Jialin Wu and Raymond Mooney. 2019. Self-critical reasoning for robust visual question answering. In Advances in Neural Information Processing Systems (NeurIPS), pages 8601–8611.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Robik Shrestha (14 papers)
  2. Kushal Kafle (22 papers)
  3. Christopher Kanan (72 papers)
Citations (36)

Summary

We haven't generated a summary for this paper yet.