Introduction
Artificial Intelligence researchers are constantly evaluating and comparing LLMs to understand their capabilities better. The Google Gemini class of models has gained attention for delivering performance that rivals the existing OpenAI GPT series across a plethora of tasks. However, a comprehensive third-party analysis is necessary to understand these capabilities in detail. This paper provides an objective, third-party comparison of Google Gemini models with OpenAI's GPT models, while also contrasting the findings with a prominent open source model, Mixtral.
Experimental Setup
In this paper, four models are evaluated: Gemini Pro, GPT 3.5. Turbo, GPT 4. Turbo, and Mixtral. These models are analyzed across 10 datasets covering knowledge-based question answering, reasoning, mathematics, code generation, machine translation, and web agent interactions. The analysis involves a reproducible codebase and transparent results for a fair and objective comparison. All models were queried through a unified interface, and the evaluation followed standardized prompts ensuring consistent conditions.
Results and Analysis
Across multiple tasks, Gemini Pro shows comparable but slightly lower accuracy than the GPT 3.5. Turbo. It fell behind significantly when compared to GPT 4. Turbo, particularly in complex reasoning and mathematical tasks involving large digits. However, Gemini Pro demonstrated stronger performance in certain areas such as multilingual translation for non-blocked responses and handling more complex reasoning tasks.
Moreover, content filtering by Gemini Pro led to several failed responses, particularly for inputs deemed potentially sensitive or illegal. The paper dives into specific sub-tasks, analyzing the model's bias towards multiple-choice response order and its performance with respect to varying levels of question complexity and chain of thought depth.
Machine Translation and Web Agents
In machine translation, Google Gemini showed competitive results but did not outperform dedicated machine translation systems, especially in translating into English. Nonetheless, it provided strong translation capabilities when confident in its responses, a significant point for languages where responses were not filtered.
For web navigation agents, Gemini Pro's success rate was comparable to GPT 3.5. Turbo but lower when compared to GPT 4. Turbo. The model's performance varied across different websites and tasks, executing shorter response lengths and fewer steps before reaching a conclusion, indicating a more direct action prediction tendency.
Conclusion
The evaluation highlights that while Gemini Pro lags behind the most advanced version of OpenAI's GPT models, it maintains a competitive edge, performing robustly under certain complex tasks. These findings serve as a recommendation to view Google's Gemini Pro as a viable option in LLM applications. However, researchers and practitioners should keep in mind the limitations and variability of these findings, including dependencies on specific prompts, generation parameters, and potential changes in models and APIs over time.