Object Attribute Matters in Visual Question Answering (2401.09442v1)
Abstract: Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better object-level visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method.
- Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR.
- Compound Tokens: Channel Fusion for Vision-Language Representation Learning. arXiv preprint arXiv:2212.01447.
- Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
- Neural module networks. In CVPR.
- Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In AAAI.
- Murel: Multimodal relational reasoning for visual question answering. In CVPR.
- Rubi: Reducing unimodal biases for visual question answering. NeurIPS.
- Counterfactual samples synthesizing for robust visual question answering. In CVPR.
- Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases. In EMNLP-IJCNLP.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.
- LRC-BERT: latent-representation contrastive knowledge distillation for natural language understanding. In AAAI.
- Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP.
- Multi-modal graph neural network for joint reasoning on vision and scene text. In CVPR.
- Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering. In CVPR.
- Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR.
- Multi-modality latent interaction network for visual question answering. In ICCV.
- MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering. In EMNLP.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR.
- KAT: A Knowledge Augmented Transformer for Vision-and-Language. In NAACL.
- AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss. In IJCAI.
- Loss re-scaling VQA: Revisiting the language prior problem from a class-imbalance view. IEEE Transactions on Image Processing.
- An analysis of visual question answering algorithms. In ICCV.
- Bilinear attention networks. NeurIPS.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML. PMLR.
- Relation-aware graph attention network for visual question answering. In ICCV.
- Dynamic key-value memory enhanced multi-step graph reasoning for knowledge-based visual question answering. In AAAI.
- Visual question answering with question representation update (QRU). NeurIPS.
- A Multi-modal Debiasing Model with Dynamical Constraint for Robust Visual Question Answering. In Findings of ACL.
- Learning to contrast the counterfactual samples for robust visual question answering. In EMNLP.
- Densely connected attention flow for visual question answering. In IJCAI.
- ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering. IEEE transactions on cybernetics.
- Hierarchical question-image co-attention for visual question answering. NeurIPS.
- Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In AAAI.
- Positional attention guided transformer-like architecture for visual question answering. IEEE Transactions on Multimedia.
- Coarse-to-fine reasoning for visual question answering. In CVPR.
- Counterfactual vqa: A cause-effect look at language bias. In CVPR.
- MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network. IEEE TPAMI.
- Answer Again: Improving VQA With Cascaded-Answering Model. IEEE Transactions on Knowledge and Data Engineering.
- Glove: Global vectors for word representation. In EMNLP.
- Overcoming language priors in visual question answering with adversarial regularization. NeurIPS.
- Exploring models and data for image question answering. NeurIPS.
- Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS.
- The graph neural network model. IEEE transactions on neural networks.
- Taking a hint: Leveraging explanations to make vision and language models more grounded. In ICCV.
- Question type guided attention in visual question answering. In ECCV.
- Check It Again: Progressive Visual Question Answering via Visual Entailment. In ACL-IJCNLP.
- Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering. In EMNLP.
- Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning. In Findings of EMNLP.
- Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA. In Findings of EMNLP.
- Combo of Thinking and Observing for Outside-Knowledge VQA. In ACL.
- From pixels to objects: cubic visual attention for visual question answering. In IJCAI.
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP-IJCNLP.
- On the value of out-of-distribution testing: An example of goodhart’s law. NeurIPS.
- MIRTT: learning multimodal interaction representations from trilinear transformers for visual question answering. In Findings of EMNLP.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML.
- Chain of reasoning for visual question answering. NeurIPS.
- Object-difference attention: A simple relational attention for visual question answering. In Proceedings of the 26th ACM international conference on Multimedia.
- Self-critical reasoning for robust visual question answering. NeurIPS.
- Stacked attention networks for image question answering. In CVPR.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- Deep modular co-attention networks for visual question answering. In CVPR.
- Vinvl: Revisiting visual representations in vision-language models. In CVPR.
- Learning to Count Objects in Natural Images for Visual Question Answering. In ICLR.
- Overcoming language priors with self-supervised learning for visual question answering. In IJCAI.
- Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In IJCAI.