Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An In-depth Look at Gemini's Language Abilities (2312.11444v2)

Published 18 Dec 2023 in cs.CL and cs.AI

Abstract: The recently released Google Gemini class of models are the first to comprehensively report results that rival the OpenAI GPT series across a wide variety of tasks. In this paper, we do an in-depth exploration of Gemini's language abilities, making two contributions. First, we provide a third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible code and fully transparent results. Second, we take a closer look at the results, identifying areas where one of the two model classes excels. We perform this analysis over 10 datasets testing a variety of language abilities, including reasoning, answering knowledge-based questions, solving math problems, translating between languages, generating code, and acting as instruction-following agents. From this analysis, we find that Gemini Pro achieves accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo on all tasks that we benchmarked. We further provide explanations for some of this under-performance, including failures in mathematical reasoning with many digits, sensitivity to multiple-choice answer ordering, aggressive content filtering, and others. We also identify areas where Gemini demonstrates comparably high performance, including generation into non-English languages, and handling longer and more complex reasoning chains. Code and data for reproduction can be found at https://github.com/neulab/gemini-benchmark

Introduction

Artificial Intelligence researchers are constantly evaluating and comparing LLMs to understand their capabilities better. The Google Gemini class of models has gained attention for delivering performance that rivals the existing OpenAI GPT series across a plethora of tasks. However, a comprehensive third-party analysis is necessary to understand these capabilities in detail. This paper provides an objective, third-party comparison of Google Gemini models with OpenAI's GPT models, while also contrasting the findings with a prominent open source model, Mixtral.

Experimental Setup

In this paper, four models are evaluated: Gemini Pro, GPT 3.5. Turbo, GPT 4. Turbo, and Mixtral. These models are analyzed across 10 datasets covering knowledge-based question answering, reasoning, mathematics, code generation, machine translation, and web agent interactions. The analysis involves a reproducible codebase and transparent results for a fair and objective comparison. All models were queried through a unified interface, and the evaluation followed standardized prompts ensuring consistent conditions.

Results and Analysis

Across multiple tasks, Gemini Pro shows comparable but slightly lower accuracy than the GPT 3.5. Turbo. It fell behind significantly when compared to GPT 4. Turbo, particularly in complex reasoning and mathematical tasks involving large digits. However, Gemini Pro demonstrated stronger performance in certain areas such as multilingual translation for non-blocked responses and handling more complex reasoning tasks.

Moreover, content filtering by Gemini Pro led to several failed responses, particularly for inputs deemed potentially sensitive or illegal. The paper dives into specific sub-tasks, analyzing the model's bias towards multiple-choice response order and its performance with respect to varying levels of question complexity and chain of thought depth.

Machine Translation and Web Agents

In machine translation, Google Gemini showed competitive results but did not outperform dedicated machine translation systems, especially in translating into English. Nonetheless, it provided strong translation capabilities when confident in its responses, a significant point for languages where responses were not filtered.

For web navigation agents, Gemini Pro's success rate was comparable to GPT 3.5. Turbo but lower when compared to GPT 4. Turbo. The model's performance varied across different websites and tasks, executing shorter response lengths and fewer steps before reaching a conclusion, indicating a more direct action prediction tendency.

Conclusion

The evaluation highlights that while Gemini Pro lags behind the most advanced version of OpenAI's GPT models, it maintains a competitive edge, performing robustly under certain complex tasks. These findings serve as a recommendation to view Google's Gemini Pro as a viable option in LLM applications. However, researchers and practitioners should keep in mind the limitations and variability of these findings, including dependencies on specific prompts, generation parameters, and potential changes in models and APIs over time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. Zeno: An interactive framework for behavioral evaluation of machine learning. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–14, 2023.
  3. Evaluating large language models trained on code. 2021.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  5. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306, 2023.
  6. A framework for few-shot language model evaluation, 12 2023a. URL https://zenodo.org/records/10256836.
  7. How to design translation prompts for chatgpt: An empirical study. arXiv preprint arXiv: 2304.02182, 2023b.
  8. Gemini Team. Gemini: A family of highly capable multimodal models. Technical report, Google, 12 2023. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf.
  9. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538, 2022.
  10. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  11. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213, 2022.
  12. MAWPS: A math word problem repository. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136.
  13. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, 2020.
  14. Mistral AI team. Mixtral of experts, December 2023. URL https://mistral.ai/news/mixtral-of-experts/. Accessed: 2023-12-15.
  15. No language left behind: Scaling human-centered machine translation. META, 2022.
  16. OpenAI. Gpt-4 technical report, 2023.
  17. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  18. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  19. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168.
  20. Maja Popović. chrf++: words helping character n-grams. In Proceedings of the second conference on machine translation, pages 612–618, 2017.
  21. Matt Post. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.
  22. Raf. What are tokens and how to count them? https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them, 2023. Accessed: 2023-12-15.
  23. Chatgpt mt: Competitive for high- (but not low-) resource languages. Conference on Machine Translation, 2023. doi: 10.48550/arXiv.2309.07423.
  24. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  25. Do llms exhibit human-like response biases? a case study in survey design. arXiv preprint arXiv:2311.04076, 2023.
  26. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc., 2017.
  27. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022a.
  28. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481, 2022b.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  30. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023a.
  31. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Syeda Nahida Akter (8 papers)
  2. Zichun Yu (8 papers)
  3. Aashiq Muhamed (8 papers)
  4. Tianyue Ou (7 papers)
  5. Alex Bäuerle (11 papers)
  6. Ángel Alexander Cabrera (11 papers)
  7. Krish Dholakia (1 paper)
  8. Chenyan Xiong (95 papers)
  9. Graham Neubig (342 papers)
Citations (31)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com