Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models (2312.17661v1)

Published 29 Dec 2023 in cs.CL, cs.AI, and cs.CV

Abstract: The burgeoning interest in Multimodal LLMs (MLLMs), such as OpenAI's GPT-4V(ision), has significantly impacted both academic and industrial realms. These models enhance LLMs with advanced visual understanding capabilities, facilitating their application in a variety of multimodal tasks. Recently, Google introduced Gemini, a cutting-edge MLLM designed specifically for multimodal integration. Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks. However, this assessment, based on a limited dataset (i.e., HellaSWAG), does not fully capture Gemini's authentic commonsense reasoning potential. To address this gap, our study undertakes a thorough evaluation of Gemini's performance in complex reasoning tasks that necessitate the integration of commonsense knowledge across modalities. We carry out a comprehensive analysis of 12 commonsense reasoning datasets, ranging from general to domain-specific tasks. This includes 11 datasets focused solely on language, as well as one that incorporates multimodal elements. Our experiments across four LLMs and two MLLMs demonstrate Gemini's competitive commonsense reasoning capabilities. Additionally, we identify common challenges faced by current LLMs and MLLMs in addressing commonsense problems, underscoring the need for further advancements in enhancing the commonsense reasoning abilities of these models.

References (52)

Citations (13)

View on Semantic Scholar

Summary

The paper evaluates Gemini's commonsense reasoning across diverse domains using 12 datasets and varied prompting techniques.
The paper finds that Gemini marginally outperforms GPT-3.5 Turbo in language-based tasks while struggling with temporal, social, and emotional cues in multimodal settings.
The paper highlights challenges in robust multimodal reasoning, suggesting clear avenues for future improvements in AI commonsense understanding.

Introduction

The paper presents an evaluation of a cutting-edge multimodal LLM (MLLM) known as Gemini, particularly focusing on its commonsense reasoning abilities. Commonsense reasoning is a core cognitive skill that humans use daily to make sense of ordinary situations and complex tasks. This proficiency is challenging to replicate in NLP systems. The research's central drive is to test Gemini's performance extensively and highlight the common challenges current LLMs and MLLMs face in commonsense tasks.

Commonsense Overview

Commonsense reasoning involves an array of domains including general, contextual, temporal, physical and numerical understanding, as well as social interactions and moral judgments. These aspects cover intuitive human understanding, predicting scenarios based on cause and effect, recognizing social cues, ethical reasoning, and interpreting visual information. It's crucial for AI systems to navigate these domains effectively to mirror human understanding and interaction.

Experimental Setup

The empirical paper encompasses the evaluation of Gemini against twelve different datasets. Four popular LLMs are assessed for language-based tasks, while two MLLMs are scrutinized for multimodal tasks. These tasks span various domains of commonsense reasoning such as general, specialized, social, ethical, and visual understanding, with a performance metric of accuracy used across all datasets. Different setups like zero-shot standard prompting and few-shot chain-of-thought prompting are employed to understand the inherent and enhanced commonsense capabilities of the models.

Results and Limitations

Findings reveal that Gemini's performance is akin to GPT-3.5 Turbo, outperforming it marginally in language-based commonsense reasoning tasks, but lagging behind GPT-4 Turbo. While demonstrating a sturdy understanding of most tested domains, Gemini struggles with temporal and social reasoning, as well as discerning emotion in images. Despite its notable logical reasoning, it often misunderstands context. The paper acknowledges limitations, such as the language and dataset scope potentially not covering all commonsense facets, and the resulting outcomes being bound to evolving AI capabilities.

Discussion

This comprehensive assessment indicates significant progress in AI's ability to reason with commonsense knowledge, yet highlights the nuanced and context-dependent nature of human reasoning remains a tough challenge. The area of multimodal reasoning, combining visual cues with language understanding, is still notably challenging. The detailed examination of the performance across diverse datasets provides valuable insights into the strengths and weaknesses of current LLMs and MLLMs, suggesting a path forward for improvements in natural, robust AI comprehension and interaction.

Related Papers

GitHub

GitHub - EternityYW/Gemini-Commonsense-Evaluation: Official implementation of "Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models" (35 stars)

Tweets

https://twitter.com/39125788/status/1741916527283343377

https://twitter.com/2465283662/status/1741658336917872755

https://twitter.com/22146921/status/1741827628452254183

https://twitter.com/4836556638/status/1741935037933822063

https://twitter.com/axel_pond/status/1743038065642230058

https://twitter.com/279718877/status/1742698749233713262