Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA (2312.13594v1)

Published 21 Dec 2023 in cs.CL, cs.AI, and cs.CV

Abstract: Natural language explanation in visual question answer (VQA-NLE) aims to explain the decision-making process of models by generating natural language sentences to increase users' trust in the black-box systems. Existing post-hoc methods have achieved significant progress in obtaining a plausible explanation. However, such post-hoc explanations are not always aligned with human logical inference, suffering from the issues on: 1) Deductive unsatisfiability, the generated explanations do not logically lead to the answer; 2) Factual inconsistency, the model falsifies its counterfactual explanation for answers without considering the facts in images; and 3) Semantic perturbation insensitivity, the model can not recognize the semantic changes caused by small perturbations. These problems reduce the faithfulness of explanations generated by models. To address the above issues, we propose a novel self-supervised \textbf{M}ulti-level \textbf{C}ontrastive \textbf{L}earning based natural language \textbf{E}xplanation model (MCLE) for VQA with semantic-level, image-level, and instance-level factual and counterfactual samples. MCLE extracts discriminative features and aligns the feature spaces from explanations with visual question and answer to generate more consistent explanations. We conduct extensive experiments, ablation analysis, and case study to demonstrate the effectiveness of our method on two VQA-NLE benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. SPICE: semantic propositional image caption evaluation. In European Conference on Computer Vision, 382–398. Springer, Springer Nature.
  2. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086.
  3. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, 2425–2433.
  4. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
  5. Rex: Reasoning-aware and grounded explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15586–15595.
  6. Uniter: Universal image-text representation learning. In European conference on computer vision, 104–120. Springer.
  7. Contrastive learning for image captioning. Advances in Neural Information Processing Systems, 30.
  8. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, 376–380.
  9. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 20(12): 3377–3388.
  10. Discriminative unsupervised feature learning with convolutional neural networks. Advances in neural information processing systems, 27.
  11. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2, 1735–1742. IEEE.
  12. e-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF international conference on computer vision, 1244–1254.
  13. Supervised contrastive learning. Advances in neural information processing systems, 33: 18661–18673.
  14. Self-supervised pre-training and contrastive representation learning for multiple-choice video qa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 13171–13179.
  15. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  16. Contrastive Learning with Adversarial Perturbations for Conditional Text Generation. In International Conference on Learning Representations.
  17. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, 121–137. Springer.
  18. Context-aware group captioning via self-attention and contrastive features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3440–3450.
  19. Learning to contrast the counterfactual samples for robust visual question answering. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), 3285–3292.
  20. Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81.
  21. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision, 740–755. Springer.
  22. Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems, 29.
  23. Visual question answering with memory-augmented networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6975–6984.
  24. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems, 27.
  25. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318.
  26. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8779–8788.
  27. U-cam: Visual explanation using uncertainty based class activation maps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7444–7453.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
  30. NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8312–8322. IEEE.
  31. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, 146–162. Springer.
  32. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618–626.
  33. S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2646–2656.
  34. What makes for good views for contrastive learning? Advances in neural information processing systems, 33: 6827–6839.
  35. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566–4575.
  36. Graph attention networks. stat, 1050(20): 10–48550.
  37. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837.
  38. Faithful Multimodal Explanation for Visual Question Answering. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 103–112.
  39. Self-critical reasoning for robust visual question answering. Advances in Neural Information Processing Systems, 32.
  40. Dynamic memory networks for visual and textual question answering. In International conference on machine learning, 2397–2406. PMLR.
  41. Chunk-aware alignment and lexical constraint for visual entailment with natural language explanations. In Proceedings of the 30th ACM International Conference on Multimedia, 3587–3597.
  42. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision, 2621–2629.
  43. Multi-level counterfactual contrast for visual commonsense reasoning. In Proceedings of the 29th ACM International Conference on Multimedia, 1793–1802.
  44. Counterfactual contrastive learning for weakly-supervised vision-language grounding. Advances in Neural Information Processing Systems, 33: 18123–18134.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chengen Lai (1 paper)
  2. Shengli Song (2 papers)
  3. Shiqi Meng (1 paper)
  4. Jingyang Li (27 papers)
  5. Sitong Yan (1 paper)
  6. Guangneng Hu (10 papers)
Citations (4)