Emergent Mind

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

(2312.17661)
Published Dec 29, 2023 in cs.CL , cs.AI , and cs.CV

Abstract

The burgeoning interest in Multimodal Large Language Models (MLLMs), such as OpenAI's GPT-4V(ision), has significantly impacted both academic and industrial realms. These models enhance Large Language Models (LLMs) with advanced visual understanding capabilities, facilitating their application in a variety of multimodal tasks. Recently, Google introduced Gemini, a cutting-edge MLLM designed specifically for multimodal integration. Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks. However, this assessment, based on a limited dataset (i.e., HellaSWAG), does not fully capture Gemini's authentic commonsense reasoning potential. To address this gap, our study undertakes a thorough evaluation of Gemini's performance in complex reasoning tasks that necessitate the integration of commonsense knowledge across modalities. We carry out a comprehensive analysis of 12 commonsense reasoning datasets, ranging from general to domain-specific tasks. This includes 11 datasets focused solely on language, as well as one that incorporates multimodal elements. Our experiments across four LLMs and two MLLMs demonstrate Gemini's competitive commonsense reasoning capabilities. Additionally, we identify common challenges faced by current LLMs and MLLMs in addressing commonsense problems, underscoring the need for further advancements in enhancing the commonsense reasoning abilities of these models.
GPT-4 Turbo outperforms in commonsense reasoning; Gemini Pro excels over GPT-3.5 Turbo except in social ethics.

Overview

  • Study evaluates Gemini, a multimodal large language model, focusing on its commonsense reasoning abilities.

  • Commonsense reasoning tested includes general, contextual, temporal, physical, numerical, social, and moral domains.

  • Gemini was compared against LLMs and MLLMs using various datasets and tasks, employing different prompting techniques.

  • Results show Gemini's competitive performance but with difficulties in temporal, social reasoning, and emotion discernment in images.

  • The paper highlights the progress and current limitations of AI commonsense reasoning, with insights for future AI improvements.

Introduction

The study presents an evaluation of a cutting-edge multimodal large language model (MLLM) known as Gemini, particularly focusing on its commonsense reasoning abilities. Commonsense reasoning is a core cognitive skill that humans use daily to make sense of ordinary situations and complex tasks. This proficiency is challenging to replicate in NLP systems. The research's central drive is to test Gemini's performance extensively and highlight the common challenges current LLMs and MLLMs face in commonsense tasks.

Commonsense Overview

Commonsense reasoning involves an array of domains including general, contextual, temporal, physical and numerical understanding, as well as social interactions and moral judgments. These aspects cover intuitive human understanding, predicting scenarios based on cause and effect, recognizing social cues, ethical reasoning, and interpreting visual information. It's crucial for AI systems to navigate these domains effectively to mirror human understanding and interaction.

Experimental Setup

The empirical study encompasses the evaluation of Gemini against twelve different datasets. Four popular LLMs are assessed for language-based tasks, while two MLLMs are scrutinized for multimodal tasks. These tasks span various domains of commonsense reasoning such as general, specialized, social, ethical, and visual understanding, with a performance metric of accuracy used across all datasets. Different setups like zero-shot standard prompting and few-shot chain-of-thought prompting are employed to understand the inherent and enhanced commonsense capabilities of the models.

Results and Limitations

Findings reveal that Gemini's performance is akin to GPT-3.5 Turbo, outperforming it marginally in language-based commonsense reasoning tasks, but lagging behind GPT-4 Turbo. While demonstrating a sturdy understanding of most tested domains, Gemini struggles with temporal and social reasoning, as well as discerning emotion in images. Despite its notable logical reasoning, it often misunderstands context. The study acknowledges limitations, such as the language and dataset scope potentially not covering all commonsense facets, and the resulting outcomes being bound to evolving AI capabilities.

Discussion

This comprehensive assessment indicates significant progress in AI's ability to reason with commonsense knowledge, yet highlights the nuanced and context-dependent nature of human reasoning remains a tough challenge. The area of multimodal reasoning, combining visual cues with language understanding, is still notably challenging. The detailed examination of the performance across diverse datasets provides valuable insights into the strengths and weaknesses of current LLMs and MLLMs, suggesting a path forward for improvements in natural, robust AI comprehension and interaction.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

YouTube
References
  1. An In-depth Look at Gemini's Language Abilities
  2. Abductive commonsense reasoning. In International Conference on Learning Representations.
  3. Prajjwal Bhargava and Vincent Ng. 2022. Commonsense knowledge reasoning and generation with pre-trained language models: A survey. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 12317–12325.
  4. ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models
  5. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  7. A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
  8. Large language models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM international conference on information and knowledge management, pages 720–730.
  9. Aligning ai with shared human values. In International Conference on Learning Representations.
  10. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401.
  11. Mvp-tuning: Multi-view knowledge retrieval with prompt tuning for commonsense reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13417–13432.
  12. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274.
  13. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2.
  14. Qasc: A dataset for question answering via sentence composition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8082–8090.
  15. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  16. A Systematic Investigation of Commonsense Knowledge in Large Language Models
  17. A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering
  18. Birds have four legs?! numersense: Probing numerical commonsense knowledge of pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6862–6868.
  19. Riddlesense: Reasoning about riddle questions featuring linguistic creativity and commonsense knowledge. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1504–1515.
  20. MM-VID: Advancing Video Understanding with GPT-4V(ision)
  21. Hugo Liu and Push Singh. 2004. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226.
  22. An Evaluation of GPT-4V and Gemini in Online VQA
  23. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  24. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys.
  25. OpenAI. 2023. Gpt-4 technical report.
  26. Artificial general intelligence: Roadmap to achieving human-level capabilities
  27. Social iqa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473.
  28. An experimental study measuring the generalization of fine-tuned language representation models across commonsense reasoning benchmarks. Expert Systems, page e13243.
  29. Vered Shwartz and Yejin Choi. 2020. Do neural language models overcome reporting bias? In Proceedings of the 28th International Conference on Computational Linguistics, pages 6863–6870.
  30. Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches
  31. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158.
  32. Pre-training is (almost) all you need: An application to commonsense reasoning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3878–3887.
  33. Gemini: A Family of Highly Capable Multimodal Models
  34. Llama 2: Open Foundation and Fine-Tuned Chat Models
  35. Self-Consistency Improves Chain of Thought Reasoning in Language Models
  36. Metacognitive Prompting Improves Understanding in Large Language Models
  37. TRAM: Benchmarking Temporal Reasoning for Large Language Models
  38. Integrating Physiological Time Series and Clinical Notes with Transformer for Early Prediction of Sepsis
  39. Enhancing Transformer Efficiency for Multivariate Time Series Classification
  40. Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding
  41. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  42. Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2550–2575.
  43. Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis
  44. Multimodal Large Language Models: A Survey
  45. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
  46. Tree of Thoughts: Deliberate Problem Solving with Large Language Models
  47. Improving commonsense in vision-language models via knowledge graph riddles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2634–2645.
  48. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731.
  49. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800
  50. A Survey of Large Language Models
  51. Empirical quantitative analysis of covid-19 forecasting models. In 2021 International Conference on Data Mining Workshops (ICDMW), pages 517–526. IEEE.
  52. Commonsense Knowledge Transfer for Pre-trained Language Models

Show All 52

Test Your Knowledge

You answered out of questions correctly.

Well done!