CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments (2403.03203v1)
Abstract: The integration of learning and reasoning is high on the research agenda in AI. Nevertheless, there is only a little attention to use existing background knowledge for reasoning about partially observed scenes to answer questions about the scene. Yet, we as humans use such knowledge frequently to infer plausible answers to visual questions (by eliminating all inconsistent ones). Such knowledge often comes in the form of constraints about objects and it tends to be highly domain or environment-specific. We contribute a novel benchmark called CLEVR-POC for reasoning-intensive visual question answering (VQA) in partially observable environments under constraints. In CLEVR-POC, knowledge in the form of logical constraints needs to be leveraged to generate plausible answers to questions about a hidden object in a given partial scene. For instance, if one has the knowledge that all cups are colored either red, green or blue and that there is only one green cup, it becomes possible to deduce the color of an occluded cup as either red or blue, provided that all other cups, including the green one, are observed. Through experiments, we observe that the low performance of pre-trained vision LLMs like CLIP (~ 22%) and a LLM like GPT-4 (~ 46%) on CLEVR-POC ascertains the necessity for frameworks that can handle reasoning-intensive tasks where environment-specific background knowledge is available and crucial. Furthermore, our demonstration illustrates that a neuro-symbolic model, which integrates an LLM like GPT-4 with a visual perception network and a formal logical reasoner, exhibits exceptional performance on CLEVR-POC.
- Savitha Sam Abraham and Marjan Alirezaie. 2022. Compositional generalization and neuro-symbolic architectures. In Combining Learning and Reasoning: Programming Languages, Formalisms, and Representations.
- Vqa: Visual question answering. In IEEE international conference on computer vision, pages 2425–2433.
- O Blender. 2018. Blender—a 3d modelling and rendering package. Retrieved. represents the sequence of Constructs1 to, 4.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Embodied question answering. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–10.
- Visual dialog. In IEEE conference on computer vision and pattern recognition, pages 326–335.
- Guess what?! visual object discovery through multi-modal dialogue. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5503–5512.
- Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. arXiv preprint arXiv:1905.06088.
- Detectron.
- Yaoshiang Ho and Samuel Wookey. 2019. The real-world-weight cross-entropy loss function: Modeling the costs of mislabeling. IEEE access, 8:4806–4813.
- Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In IEEE conference on computer vision and pattern recognition, pages 2901–2910.
- Incheol Kim. 2020. Visual experience-based question answering with complex multimodal environments. Mathematical Problems in Engineering, 2020.
- Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. arXiv preprint arXiv:1903.03166.
- Gradient-based learning applied to document recognition. IEEE, 86(11):2278–2324.
- Vladimir Lifschitz. 2008. What is answer set programming? In 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, page 1594–1597. AAAI Press.
- Hugh MacColl. 1897. Symbolic reasoning. Mind, 6(24):493–510.
- Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627.
- Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems, 27.
- Deepproblog: Neural probabilistic logic programming. advances in neural information processing systems, 31.
- The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584.
- From statistical relational to neural symbolic artificial intelligence: a survey. arXiv preprint arXiv:2108.11451.
- Unified questioner transformer for descriptive question generation in goal-oriented visual dialogue. In IEEE/CVF International Conference on Computer Vision, pages 1898–1907.
- Rationality in human nonmonotonic inference.
- OpenAI. 2023. Gpt-4 technical report.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Luciano Serafini and Artur d’Avila Garcez. 2016. Logic tensor networks: Deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422.
- Navigation with large language models: Semantic guesswork as a heuristic for planning. In 7th Annual Conference on Robot Learning.
- Kvqa: Knowledge-aware visual question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):8876–8884.
- Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427.
- Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 163:21–40.
- A semantic loss function for deep learning with symbolic knowledge. In International conference on machine learning, pages 5502–5511. PMLR.
- Neurasp: Embracing neural networks into answer set programming. In 29th International Joint Conference on Artificial Intelligence (IJCAI 2020).
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Advances in neural information processing systems, 31.
- Navgpt: Explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986.
- Yeyun Zou and Qiyu Xie. 2020. A survey on vqa: Datasets and approaches. In 2020 2nd International Conference on Information Technology and Computer Application (ITCA), pages 289–297. IEEE.
- Savitha Sam Abraham (8 papers)
- Marjan Alirezaie (5 papers)
- Luc De Raedt (55 papers)