Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering (2305.14882v2)
Abstract: Recent advances in multimodal LLMs have shown extreme effectiveness in visual question answering (VQA). However, the design nature of these end-to-end models prevents them from being interpretable to humans, undermining trust and applicability in critical domains. While post-hoc rationales offer certain insight into understanding model behavior, these explanations are not guaranteed to be faithful to the model. In this paper, we address these shortcomings by introducing an interpretable by design model that factors model decisions into intermediate human-legible explanations, and allows people to easily understand why a model fails or succeeds. We propose the Dynamic Clue Bottleneck Model ( (DCLUB), a method that is designed towards an inherently interpretable VQA system. DCLUB provides an explainable intermediate space before the VQA decision and is faithful from the beginning, while maintaining comparable performance to black-box systems. Given a question, DCLUB first returns a set of visual clues: natural language statements of visually salient evidence from the image, and then generates the output based solely on the visual clues. To supervise and evaluate the generation of VQA explanations within DCLUB, we collect a dataset of 1.7k reasoning-focused questions with visual clues. Evaluations show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions while preserving 99.43% of performance on VQA-v2.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086, 2018.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Propsegment: A large-scale corpus for proposition-level segmentation and entailment recognition. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023.
- Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=3Pf3Wg6o-A4.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding, 163:90–100, 2017.
- Beyond vqa: Generating multi-word answers and rationales to visual questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1623–1632, 2021.
- Generic temporal reasoning with differential analysis and explanation. ACL, 2023.
- There’s a time and place for reasoning beyond the image. ACL, 2022.
- Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge. In Findings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2023. URL https://cogcomp.seas.upenn.edu/papers/paper-to-come.pdf.
- Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Darpa’s explainable artificial intelligence (xai) program. AI magazine, 40(2):44–58, 2019.
- Generating visual explanations. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 3–19. Springer, 2016.
- Grounding visual explanations. In Proceedings of the European conference on computer vision (ECCV), pp. 264–279, 2018.
- Faithful question answering with monte-carlo planning. arXiv preprint arXiv:2305.02556, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6700–6709, 2019.
- Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022.
- QASem parsing: Text-to-text modeling of QA-based semantics. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7742–7756, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.528.
- Concept bottleneck models. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5338–5348. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/koh20a.html.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- Faithful chain-of-thought reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 305–329, Nusa Dua, Bali, November 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.ijcnlp-main.20.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8779–8788, 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215, 2019.
- A-okvqa: A benchmark for visual question answering using world knowledge. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pp. 146–162. Springer, 2022.
- Grad-cam: Why did you say that? arXiv preprint arXiv:1611.07450, 2016.
- Squinting at vqa models: Introspecting vqa models with sub-questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10003–10011, 2020.
- Where to look: Focus regions for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4613–4621, 2016.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024.
- Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
- Dynamic memory networks for visual and textual question answering. In International conference on machine learning, pp. 2397–2406. PMLR, 2016.
- Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19187–19197, 2023.
- ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023.
- Learning to decompose: Hypothetical question decomposition based on comparable texts. EMNLP, 2022.
- Xingyu Fu (22 papers)
- Ben Zhou (29 papers)
- Sihao Chen (25 papers)
- Mark Yatskar (38 papers)
- Dan Roth (222 papers)