KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning (2401.12863v1)
Abstract: LLMs have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) that enables step-by-step thinking. Extending LLMs with multimodal capabilities is the recent interest, but incurs computational cost and requires substantial hardware resources. To address these challenges, we propose KAM-CoT a framework that integrates CoT reasoning, Knowledge Graphs (KGs), and multiple modalities for a comprehensive understanding of multimodal tasks. KAM-CoT adopts a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains a deeper contextual understanding reducing hallucinations and enhancing the quality of answers. This knowledge-augmented CoT reasoning empowers the model to handle questions requiring external context, providing more informed answers. Experimental findings show KAM-CoT outperforms the state-of-the-art methods. On the ScienceQA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-CoT achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, 2425–2433.
- Relational Graph Attention Networks. arXiv:1904.05811.
- End-to-End Object Detection with Transformers. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, 213–229. Berlin, Heidelberg: Springer-Verlag. ISBN 978-3-030-58451-1.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
- Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.
- Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1295–1309. Online: Association for Computational Linguistics.
- Fast Graph Representation Learning with PyTorch Geometric. arXiv:1903.02428.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
- SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions. arXiv preprint arXiv:2307.01139.
- PromptCap: Prompt-Guided Task-Aware Image Captioning. arXiv preprint arXiv:2211.09699.
- Language Is Not All You Need: Aligning Perception with Language Models. ArXiv, abs/2302.14045.
- Webly Supervised Concept Expansion for General Purpose Vision Models. In Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; and Hassner, T., eds., Computer Vision – ECCV 2022, 662–681. Cham: Springer Nature Switzerland. ISBN 978-3-031-20059-5.
- UNIFIEDQA: Crossing Format Boundaries with a Single QA System. In Findings of the Association for Computational Linguistics: EMNLP 2020, 1896–1907. Online: Association for Computational Linguistics.
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In International Conference on Machine Learning.
- Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 22199–22213.
- Large Language Models are Zero-Shot Reasoners. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 22199–22213. Curran Associates, Inc.
- On Vision Features in Multimodal Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 6327–6337. Dublin, Ireland: Association for Computational Linguistics.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv:1908.03557.
- What Does BERT with Vision Look At? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5265–5275. Online: Association for Computational Linguistics.
- Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
- KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2829–2839. Hong Kong, China: Association for Computational Linguistics.
- Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74–81. Barcelona, Spain: Association for Computational Linguistics.
- Visual Instruction Tuning. arXiv:2304.08485.
- Prismer: A Vision-Language Model with An Ensemble of Experts. In ArXiv Preprint.
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
- Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842.
- Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning. In The Eleventh International Conference on Learning Representations.
- IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. ArXiv, abs/2110.13214.
- Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models. arXiv:2305.15023.
- KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14106–14116.
- SKILL: Structured Knowledge Infusion for Large Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1581–1588. Seattle, United States: Association for Computational Linguistics.
- OpenAI. 2022. ChatGPT, OpenAI. chat.openai.com. Accessed: 2023.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- Learning Transferable Visual Models From Natural Language Supervision. In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, 8748–8763. PMLR.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
- A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge. arXiv.
- Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering. In Computer Vision and Pattern Recognition (CVPR), 14974–14983.
- ConceptNet 5.5: An Open Multilingual Graph of General Knowledge.
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5100–5111. Hong Kong, China: Association for Computational Linguistics.
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering. arXiv:2305.03453.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Vqa-gnn: Reasoning with multimodal semantic graph for visual question answering. arXiv preprint arXiv:2205.11501.
- Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. Survey Certification.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837.
- Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 6153–6166. Online: Association for Computational Linguistics.
- QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 535–546. Online: Association for Computational Linguistics.
- Neural machine translation: Challenges, progress and future. Science China Technological Sciences, 63: 2028 – 2050.
- LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. arXiv:2303.16199.
- GreaseLM: Graph REASoning Enhanced Language Models for Question Answering. arXiv:2201.08860.
- Automatic Chain of Thought Prompting in Large Language Models. In The Eleventh International Conference on Learning Representations.
- Multimodal Chain-of-Thought Reasoning in Language Models. arXiv:2302.00923.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In The Eleventh International Conference on Learning Representations.
- Debjyoti Mondal (2 papers)
- Suraj Modi (1 paper)
- Subhadarshi Panda (3 papers)
- Rituraj Singh (3 papers)
- Godawari Sudhakar Rao (1 paper)