An In-depth Look at Gemini's Language Abilities (2312.11444v2)

Published 18 Dec 2023 in cs.CL and cs.AI

Abstract: The recently released Google Gemini class of models are the first to comprehensively report results that rival the OpenAI GPT series across a wide variety of tasks. In this paper, we do an in-depth exploration of Gemini's language abilities, making two contributions. First, we provide a third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible code and fully transparent results. Second, we take a closer look at the results, identifying areas where one of the two model classes excels. We perform this analysis over 10 datasets testing a variety of language abilities, including reasoning, answering knowledge-based questions, solving math problems, translating between languages, generating code, and acting as instruction-following agents. From this analysis, we find that Gemini Pro achieves accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo on all tasks that we benchmarked. We further provide explanations for some of this under-performance, including failures in mathematical reasoning with many digits, sensitivity to multiple-choice answer ordering, aggressive content filtering, and others. We also identify areas where Gemini demonstrates comparably high performance, including generation into non-English languages, and handling longer and more complex reasoning chains. Code and data for reproduction can be found at https://github.com/neulab/gemini-benchmark

PDF HTML Abstract

Introduction

Artificial Intelligence researchers are constantly evaluating and comparing LLMs to understand their capabilities better. The Google Gemini class of models has gained attention for delivering performance that rivals the existing OpenAI GPT series across a plethora of tasks. However, a comprehensive third-party analysis is necessary to understand these capabilities in detail. This paper provides an objective, third-party comparison of Google Gemini models with OpenAI's GPT models, while also contrasting the findings with a prominent open source model, Mixtral.

Experimental Setup

In this paper, four models are evaluated: Gemini Pro, GPT 3.5. Turbo, GPT 4. Turbo, and Mixtral. These models are analyzed across 10 datasets covering knowledge-based question answering, reasoning, mathematics, code generation, machine translation, and web agent interactions. The analysis involves a reproducible codebase and transparent results for a fair and objective comparison. All models were queried through a unified interface, and the evaluation followed standardized prompts ensuring consistent conditions.

Results and Analysis

Across multiple tasks, Gemini Pro shows comparable but slightly lower accuracy than the GPT 3.5. Turbo. It fell behind significantly when compared to GPT 4. Turbo, particularly in complex reasoning and mathematical tasks involving large digits. However, Gemini Pro demonstrated stronger performance in certain areas such as multilingual translation for non-blocked responses and handling more complex reasoning tasks.

Moreover, content filtering by Gemini Pro led to several failed responses, particularly for inputs deemed potentially sensitive or illegal. The paper dives into specific sub-tasks, analyzing the model's bias towards multiple-choice response order and its performance with respect to varying levels of question complexity and chain of thought depth.

Machine Translation and Web Agents

In machine translation, Google Gemini showed competitive results but did not outperform dedicated machine translation systems, especially in translating into English. Nonetheless, it provided strong translation capabilities when confident in its responses, a significant point for languages where responses were not filtered.

For web navigation agents, Gemini Pro's success rate was comparable to GPT 3.5. Turbo but lower when compared to GPT 4. Turbo. The model's performance varied across different websites and tasks, executing shorter response lengths and fewer steps before reaching a conclusion, indicating a more direct action prediction tendency.

Conclusion

The evaluation highlights that while Gemini Pro lags behind the most advanced version of OpenAI's GPT models, it maintains a competitive edge, performing robustly under certain complex tasks. These findings serve as a recommendation to view Google's Gemini Pro as a viable option in LLM applications. However, researchers and practitioners should keep in mind the limitations and variability of these findings, including dependencies on specific prompts, generation parameters, and potential changes in models and APIs over time.

PDF Markdown Bookmark Chat (Pro)

References (31)

Authors (9)

Syeda Nahida Akter (8 papers)
Zichun Yu (8 papers)
Aashiq Muhamed (8 papers)
Tianyue Ou (7 papers)
Alex Bäuerle (11 papers)
Ángel Alexander Cabrera (11 papers)
Krish Dholakia (1 paper)
Chenyan Xiong (95 papers)
Graham Neubig (342 papers)

Citations (31)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - neulab/gemini-benchmark (150 stars)

Tweets

https://twitter.com/441465751/status/1736995773483802837

https://twitter.com/794433401591693312/status/1736998198697238672

https://twitter.com/airesearch12/status/1762139704638525793

https://twitter.com/784830007/status/1737121494629494828

https://twitter.com/1253834003871936513/status/1737437896968163749

https://twitter.com/52665471/status/1737252628932018211