VCD: Knowledge Base Guided Visual Commonsense Discovery in Images (2402.17213v1)
Abstract: Visual commonsense contains knowledge about object properties, relationships, and behaviors in visual data. Discovering visual commonsense can provide a more comprehensive and richer understanding of images, and enhance the reasoning and decision-making capabilities of computer vision systems. However, the visual commonsense defined in existing visual commonsense discovery studies is coarse-grained and incomplete. In this work, we draw inspiration from a commonsense knowledge base ConceptNet in natural language processing, and systematically define the types of visual commonsense. Based on this, we introduce a new task, Visual Commonsense Discovery (VCD), aiming to extract fine-grained commonsense of different types contained within different objects in the image. We accordingly construct a dataset (VCDD) from Visual Genome and ConceptNet for VCD, featuring over 100,000 images and 14 million object-commonsense pairs. We furthermore propose a generative model (VCDM) that integrates a vision-LLM with instruction tuning to tackle VCD. Automatic and human evaluations demonstrate VCDM's proficiency in VCD, particularly outperforming GPT-4V in implicit commonsense discovery. The value of VCD is further demonstrated by its application to two downstream tasks, including visual commonsense evaluation and visual question answering. The data and code will be made available on GitHub.
- Is bert blind? exploring the effect of vision-and-language pretraining on visual language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6778–6788, 2023.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. In Advances in Neural Information Processing Systems, pages 32897–32912. Curran Associates, Inc., 2022.
- Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images, 2023.
- COMET: Commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4762–4779, Florence, Italy, 2019. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901. Curran Associates, Inc., 2020.
- Mining semantic affordances of visual object categories. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4259–4267, 2015.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. CoRR, abs/2306.15195, 2023.
- Neil: Extracting visual knowledge from web data. In 2013 IEEE International Conference on Computer Vision, pages 1409–1416, 2013.
- Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 904–915. ACM, 2022.
- Uniter: Universal image-text representation learning. In Computer Vision – ECCV 2020, pages 104–120, Cham, 2020. Springer International Publishing.
- Acquiring common sense spatial knowledge through implicit spatial templates. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- Grounding consistency: Distilling spatial common sense for precise visual relationship detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15911–15920, 2021.
- Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031, 2021.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Mental imagery and the comprehension-monitoring performance of fourth- and fifth-grade poor readers. Reading Research Quarterly, 21(4):454–464, 1986.
- AllenNLP: A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1–6, Melbourne, Australia, 2018. Association for Computational Linguistics.
- Reporting bias and knowledge acquisition. In Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, page 25–30, New York, NY, USA, 2013. Association for Computing Machinery.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- The abduction of sherlock holmes: A dataset for visual abductive reasoning. In Computer Vision – ECCV 2022, pages 558–575, Cham, 2022. Springer Nature Switzerland.
- spaCy: Industrial-strength Natural Language Processing in Python. 2020.
- Fast contextual scene graph generation with unbiased context augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6302–6311, 2023.
- Devil’s on the edges: Selective quad attention for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18664–18674, 2023.
- The reviewing of object files: Object-specific integration of information. Cognitive Psychology, 24(2):175–219, 1992.
- Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
- Is-ggt: Iterative scene graph generation with generative transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6292–6301, 2023.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, pages 19730–19742. PMLR, 2023.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision – ECCV 2020, pages 121–137, Cham, 2020. Springer International Publishing.
- Visual abductive reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15565–15575, 2022.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computational Linguistics.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- Things not written in text: Exploring spatial commonsense from visual signals. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2365–2376, Dublin, Ireland, 2022. Association for Computational Linguistics.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland, 2022. Association for Computational Linguistics.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023a.
- OpenAI. Gpt-4v(ision) technical work and authors. 2023b.
- The World of an Octopus: How Reporting Bias Influences a Language Model’s Perception of Color. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 823–835, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics.
- Visualcomet: Reasoning about the dynamic context of a still image. In Computer Vision – ECCV 2020, pages 508–524, Cham, 2020. Springer International Publishing.
- Kosmos-2: Grounding multimodal large language models to the world. CoRR, abs/2306.14824, 2023.
- Atomic: An atlas of machine commonsense for if-then reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3027–3035, 2019.
- Structured query-based image retrieval using scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020.
- Dense-atomic: Towards densely-connected ATOMIC with high knowledge coverage and massive multi-hop paths. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13292–13305. Association for Computational Linguistics, 2023.
- Do neural language models overcome reporting bias? In Proceedings of the 28th International Conference on Computational Linguistics, pages 6863–6870, Barcelona, Spain (Online), 2020. International Committee on Computational Linguistics.
- Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 4444–4451. AAAI Press, 2017.
- Elizabeth S. Spelke. Principles of object perception. Cognitive Science, 14(1):29–56, 1990.
- A new quantitative quality measure for machine translation systems. In COLING 1992 Volume 2: The 15th International Conference on Computational Linguistics, 1992.
- Moss: Training conversational language models from synthetic data. 2023.
- Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11888–11898, 2023.
- Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Intrinsic physical concepts discovery with object-centric predictive models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23252–23261, 2023.
- Some beginnings of word comprehension in 6-month-olds. Psychological Science, 10:172 – 175, 1999.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b.
- Learning common sense through visual abstraction. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2542–2550, 2015.
- A-cap: Anticipation captioning with commonsense knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10824–10833, 2023.
- OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the 39th International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
- The all-seeing project: Towards panoptic visual recognition and understanding of the open world. CoRR, abs/2308.01907, 2023a.
- Vqa-gnn: Reasoning with multimodal knowledge via graph neural networks for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21582–21592, 2023b.
- Symbolic knowledge distillation: from general language models to commonsense models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4602–4625, Seattle, United States, 2022. Association for Computational Linguistics.
- Imagenetvc: Zero- and few-shot visual commonsense evaluation on 1000 imagenet categories, 2023.
- Imagine, reason and write: Visual storytelling with graph knowledge and relational reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4):3022–3029, 2021.
- Automatic extraction of commonsense LocatedNear knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 96–101, Melbourne, Australia, 2018. Association for Computational Linguistics.
- MultiInstruct: Improving multi-modal zero-shot learning via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11445–11465, Toronto, Canada, 2023. Association for Computational Linguistics.
- Visual distant supervision for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15816–15826, 2021.
- Visually grounded commonsense knowledge acquisition. Proceedings of the AAAI Conference on Artificial Intelligence, 37(5):6583–6592, 2023.
- Stating the obvious: Extracting visual common sense knowledge. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 193–198, San Diego, California, 2016. Association for Computational Linguistics.
- Improving commonsense in vision-language models via knowledge graph riddles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2634–2645, 2023.
- Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
- From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019a.
- From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019b.
- PIGLeT: Language grounding through neuro-symbolic interaction in a 3D world. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2040–2050, Online, 2021. Association for Computational Linguistics.
- Visual commonsense in pretrained unimodal and multimodal models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5321–5335, Seattle, United States, 2022. Association for Computational Linguistics.
- Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5579–5588, 2021.
- Gpt4roi: Instruction tuning large language model on region-of-interest. CoRR, abs/2307.03601, 2023a.
- Toward multi-granularity decision-making: Explicit visual reasoning with hierarchical knowledge. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2573–2583, 2023b.
- Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. CoRR, abs/2307.09474, 2023.
- Regionblip: A unified multi-modal pre-training framework for holistic and regional comprehension. CoRR, abs/2308.02299, 2023.
- Evaluating commonsense in pre-trained language models. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):9733–9740, 2020.
- Visualize before you write: Imagination-guided open-ended text generation. In Findings of the Association for Computational Linguistics: EACL 2023, pages 78–92, Dubrovnik, Croatia, 2023a. Association for Computational Linguistics.
- Personality-aware human-centric multimodal reasoning: A new task. arXiv preprint arXiv:2304.02313, 2023b.
- Xiangqing Shen (7 papers)
- Yurun Song (4 papers)
- Siwei Wu (26 papers)
- Rui Xia (53 papers)