ConVQG: Contrastive Visual Question Generation with Multimodal Guidance
Abstract: Asking questions about visual environments is a crucial way for intelligent agents to understand rich multi-faceted scenes, raising the importance of Visual Question Generation (VQG) systems. Apart from being grounded to the image, existing VQG systems can use textual constraints, such as expected answers or knowledge triplets, to generate focused questions. These constraints allow VQG systems to specify the question content or leverage external commonsense knowledge that can not be obtained from the image content only. However, generating focused questions using textual constraints while enforcing a high relevance to the image content remains a challenge, as VQG systems often ignore one or both forms of grounding. In this work, we propose Contrastive Visual Question Generation (ConVQG), a method using a dual contrastive objective to discriminate questions generated using both modalities from those based on a single one. Experiments on both knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and generates image-grounded, text-guided, and knowledge-rich questions. Our human evaluation results also show preference for ConVQG questions compared to non-contrastive baselines.
- Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 6077–6086.
- VQA: Visual question answering. In ICCV, 2425–2433.
- Dbpedia: A nucleus for a web of open data. In ISWC, 722–735.
- Language models are few-shot learners. In NeurIPS, 1877–1901.
- VLP: A survey on vision-language pre-training. Machine Intelligence Research, 20(1): 38–56.
- A simple framework for contrastive learning of visual representations. In ICML.
- Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- UNITER: Universal image-text representation learning. In ECCV, 104–120.
- Visual dialog. In CVPR, 326–335.
- Imagenet: A large-scale hierarchical image database. In CVPR, 248–255.
- Meteor universal: Language specific translation evaluation for any target language. In WMT, 376–380.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
- Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 14(3–4): 163–352.
- Visual Turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12): 3618–3623.
- Making the v in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 6904–6913.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 9729–9738.
- Creativity: Generating diverse questions using variational autoencoders. In CVPR, 6485–6494.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.
- Deep visual-semantic alignments for generating image descriptions. In CVPR, 3128–3137.
- Supervised contrastive learning. In NeurIPS, 18661–18673.
- Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
- Attention-based contrastive learning for winograd schemas. In EMNLP-Findings, 2428–2434.
- Information maximizing visual question generation. In CVPR, 2008–2018.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
- Visual question generation as dual task of visual question answering. In CVPR, 6116–6124.
- Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81.
- Microsoft COCO: Common objects in context. In ECCV, 740–755.
- Inverse visual question answering: A new benchmark and VQA diagnosis tool. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 460–474.
- Generating natural questions about an image. In ACL, 1802–1813.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- OpenAI. 2023. GPT-4 Technical report. Technical report.
- Training language models to follow instructions with human feedback. In NeurIPS, 27730–27744.
- BLEU: A method for automatic evaluation of machine translation. In ACL, 311–318.
- Deep bayesian network for visual question generation. In WACV, 1566–1576.
- Multimodal differential network for visual question generation. In EMNLP, 4002–4012.
- Learning transferable visual models from natural language supervision. In ICML.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1): 5485–5551.
- Sentence-BERT: Sentence embeddings using siamese BERT-networks. In EMNLP-IJCNLP, 3982–3992.
- Exploring models and data for image question answering. In NeurIPS.
- Contrastive learning of general-purpose audio representations. In ICASSP, 3875–3879.
- LAION-400M: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
- ConceptNet 5.5: An open multilingual graph of general knowledge. In AAAI.
- Webchild: Harvesting and organizing commonsense knowledge from the web. In WSDM, 523–532.
- K-VQG: Knowledge-aware visual question generation for common-sense acquisition. In WACV, 4401–4409.
- C3VQG: Category consistent cyclic visual question generation. In ACM MM Asia.
- Generative language models for paragraph-level question generation. In EMNLP, 670–688.
- Attention is all you need. In NeurIPS.
- CIDEr: Consensus-based image description evaluation. In CVPR, 4566–4575.
- Guiding visual question generation. In ACL, 1640–1654.
- Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
- FVQA: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence, 40(10): 2413–2427.
- Multiple objects-aware visual question generation. In ACM MM, 4546–4554.
- Knowledge-based visual question generation. IEEE Transactions on Circuits and Systems for Video Technology, 32(11): 7547–7558.
- Show, attend and tell: Neural image caption generation with visual attention. In ICML.
- Dual learning for visual question generation. In ICME.
- Radial graph convolutional network for visual question generation. IEEE Transactions on Neural Networks and Learning Systems, 32(4): 1654–1667.
- Automatic generation of grounded visual questions. In IJCAI, 4235–4243.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.