CommVQA: Situating Visual Question Answering in Communicative Contexts (2402.15002v2)
Abstract: Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation. However, the questions people ask are dependent on their informational needs and prior knowledge about the image content. To evaluate how situating images within naturalistic contexts shapes visual questions, we introduce CommVQA, a VQA dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear (e.g., a travel website), and follow-up questions and answers conditioned on the scenario and description. CommVQA, which contains 1000 images and 8,949 question-answer pairs, poses a challenge for current models. Error analyses and a human-subjects study suggest that generated answers still contain high rates of hallucinations, fail to fittingly address unanswerable questions, and don't suitably reflect contextual information. Overall, we show that access to contextual information is essential for solving CommVQA, leading to the highest performing VQA model and highlighting the relevance of situating systems within communicative scenarios.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- VQA: Visual Question Answering. IEEE International Conference on Computer Vision, abs/1505.00468.
- The spoon is in the sink: Assisting visually impaired people in the kitchen. In Proceedings of the Reasoning and Interaction Conference (ReInAct 2021), pages 32–39, Gothenburg, Sweden. Association for Computational Linguistics.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
- “It’s complicated”: Negotiating accessibility and (mis) representation in image descriptions of race, gender, and disability. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–19.
- Fully authentic visual question answering dataset from online communities. arXiv preprint arXiv:2311.15562.
- Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Michael C Frank and Noah D Goodman. 2012. Predicting pragmatic reasoning in language games. Science, 336(6084):998–998.
- Jonathan Ginzburg. 1996. Dynamics and the semantics of dialogue. In Jerry Seligman, editor, Language, Logic, and Computation, volume 1, pages 221–237. CSLI, Stanford, CA.
- "It’s almost like they’re trying to hide it”: How User-Provided Image Descriptions Have Failed to Make Twitter Accessible. In The World Wide Web Conference, pages 549–559.
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition.
- Herbert P Grice. 1975. Logic and conversation. In Speech acts, pages 41–58. Brill.
- Jeroen Groenendijk and Martin Stokhof. 1984. Studies in the Semantics of Questions and the Pragmatics of Answers. Ph.D. thesis, University of Amsterdam.
- VizWiz Grand Challenge: Answering Visual Questions from Blind People. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608–3617.
- CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528.
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
- Drew A Hudson and Christopher D Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709.
- Context matters for image descriptions for accessibility: Challenges for referenceless evaluation metrics. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4685–4697.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
- Understanding blind people’s experiences with computer-generated captions of social media images. In proceedings of the 2017 CHI conference on human factors in computing systems, pages 5988–5999.
- OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3195–3204.
- "With most of it being pictures now, I rarely use it" Understanding Twitter’s Evolving Accessibility to Blind Users. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 5506–5516.
- Annika Muehlbradt and Shaun K Kane. 2022. What’s in an ALT Tag? Exploring Caption Content Priorities through Collaborative Captioning. ACM Transactions on Accessible Computing (TACCESS), 15(1):1–32.
- OpenAI. 2023a. GPT-4 Technical Report.
- OpenAI. 2023b. GPT4-V System Card. https://openai.com/research/gpt-4v-system-card. [Online; accessed 10-Feb-2024].
- OSU. 2024. Alternative (Alt) Text Guide. https://ets.osu.edu/digital-accessibility/alternative-alt-text-guide. [Online; accessed 10-Feb-2024].
- Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
- Nils Reimers and Iryna Gurevych. 2019. SentenceBert: Sentence embeddings using siamese bertnetworks. In In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),, page 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Craige Roberts. 1996. Information structure: Towards an integrated formal theory of pragmatics. In Jae Hak Yoon and Andreas Kathol, editors, OSU Working Papers in Linguistics, volume 49: Papers in Semantics, pages 91–136. The Ohio State University Department of Linguistics, Columbus, OH. Revised 1998.
- Object hallucination in image captioning. In Empirical Methods in Natural Language Processing (EMNLP).
- "Person, Shoes, Tree. Is the Person Naked?" What People with Vision Impairments Want in Image Descriptions. In Proceedings of the 2020 chi conference on human factors in computing systems, pages 1–13.
- Going Beyond One-Size-Fits-All Image Descriptions to Satisfy the Information Wants of People Who are Blind or Have Low Vision. In 23rd International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’21).
- Overinformative Question Answering by Humans and Machines. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 45.
- Robert van Rooy. 2003. Questioning to resolve decision problems. Linguistics and Philosophy, 26(6):727–763.
- CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575.
- How blind people interact with visual content on social networking services. In Proceedings of the 19th acm conference on computer-supported cooperative work & social computing, pages 1584–1595.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality.