Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems (2402.14008v2)

Published 21 Feb 2024 in cs.CL
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Abstract: Recent advancements have seen LLMs and Large Multimodal Models (LMMs) surpassing general human capabilities in various tasks, approaching the proficiency level of human experts across multiple domains. With traditional benchmarks becoming less challenging for these models, new rigorous challenges are essential to gauge their advanced abilities. In this work, we present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam. Each problem is detailed with expert-level annotations for step-by-step reasoning. Evaluating top-tier models on OlympiadBench, we implement a comprehensive assessment methodology to accurately evaluate model responses. Notably, the best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics, highlighting the benchmark rigor and the intricacy of physical reasoning. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies. We hope that our challenging benchmark can serve as a valuable resource for helping future AGI research endeavors. The data and evaluation code are available at \url{https://github.com/OpenBMB/OlympiadBench}

OlympiadBench: Elevating Benchmark Challenges in AI with Olympiad-Level Bilingual Multimodal Scientific Problems

Introduction of OlympiadBench

The rapid advancements in LLMs and Large Multimodal Models (LMMs) have necessitated the development of more rigorous assessment tools. OlympiadBench addresses this need by introducing a benchmark featuring 8,952 problems from high-level mathematics and physics competitions, with a focus on Olympiad-level challenges. This benchmark is distinct in its bilingual (English and Chinese) and multimodal attributes, each problem accompanied by expert annotations for step-by-step reasoning. Notably, the best-performing model, GPT-4V, achieves an average score of 17.23% on the benchmark, underscoring the challenges posed by OlympiadBench in modeling physical reasoning and problem-solving.

Key Features of OlympiadBench

  • Comprehensive Problem Set: Comprising a vast collection of problems sourced from prestigious Olympiads and Chinese college entrance exams, OlympiadBench presents a diverse range of challenges designed to test the limits of current AI capabilities.
  • Expert Annotations: Every problem includes detailed annotations from experts, providing valuable insights into the reasoning processes required to solve complex scientific issues.
  • Bilingual and Multimodal Approach: By offering problems in both English and Chinese and incorporating multimodal data, OlympiadBench emphasizes the importance of versatility in language and medium for AI research.
  • Robust Evaluation Methodology: Utilizing a thorough assessment methodology, this benchmark accurately evaluates AI responses, highlighting prevalent issues such as hallucinations, knowledge omissions, and logical fallacies in AI-generated solutions.

Challenges Highlighted by OlympiadBench

The findings from OlympiadBench highlight several pivotal challenges for AI models, particularly in solving physics problems and generating error-free reasoning. The benchmark's complexity is illustrated by the relatively low problem-solving success rates, which reveal significant gaps in AI capabilities compared to human experts. These challenges serve as a critical reminder of the considerable room for improvement and growth in the field of AI and AGI research.

Implications for Future Research

OlympiadBench sets a new precedent for the complexity and rigorousness of benchmarks in AI research. The benchmark not only challenges the AI research community to develop models that can tackle higher levels of scientific reasoning but also provides a novel dataset for training and testing next-generation AI systems. The bilingual and multimodal nature of the problems in OlympiadBench opens the door for exploration into new realms of AI capabilities, encouraging advancements in understanding and processing complex scientific texts and visuals in multiple languages.

Conclusion

OlympiadBench stands as a significant contribution to the field of AI, pushing the boundaries of what is considered a challenging benchmark. By placing a strong emphasis on Olympiad-level problems, this benchmark underscores the necessity for continuous innovation and development within AI research to reach and surpass human expert levels of problem-solving and reasoning. As the AI community strives towards the goal of achieving AGI, resources like OlympiadBench will be instrumental in benchmarking progress and guiding research efforts towards addressing the most daunting challenges in AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. 01-ai. 2023. Yi-34b-chat model card.
  2. 01-ai. 2024. Yi-vl-34b model card.
  3. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics.
  4. Have llms advanced enough? a challenging problem solving benchmark for large language models.
  5. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  6. Daniel Bobrow et al. 1964. Natural language input for a computer problem solving system.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4.
  8. Jie Cao and Jing Xiao. 2022. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1511–1520, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  9. UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3313–3323, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  10. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523, Online. Association for Computational Linguistics.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  12. Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694.
  13. Mathematical capabilities of chatgpt.
  14. Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning.
  15. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  16. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  17. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
  18. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs.
  19. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California. Association for Computational Linguistics.
  20. Solving quantitative reasoning problems with language models.
  21. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
  22. Llava-next: Improved reasoning, ocr, and world knowledge.
  23. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  24. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR).
  25. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6774–6786, Online. Association for Computational Linguistics.
  26. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
  27. A survey of deep learning for mathematical reasoning.
  28. Ha-Thanh Nguyen. 2023. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3. arXiv preprint arXiv:2302.05729.
  29. NousResearch. 2023. Nous-hermes-2-yi-34b model card.
  30. OpenAI. 2023a. Gpt-4 technical report.
  31. OpenAI. 2023b. Gpt-4v(ision) system card.
  32. Transfer knowledge from natural language to electrocardiography: Can we detect cardiovascular disease through language models? arXiv preprint arXiv:2301.09017.
  33. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
  34. Scieval: A multi-level large language model evaluation benchmark for scientific research.
  35. Gemini Team. 2023. Gemini: A family of highly capable multimodal models.
  36. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482.
  37. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.
  38. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854, Copenhagen, Denmark. Association for Computational Linguistics.
  39. Emergent abilities of large language models.
  40. Cmath: Can your language model pass chinese elementary school math test?
  41. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark. Association for Computational Linguistics.
  42. Naturalproofs: Mathematical theorem proving in natural language.
  43. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
  44. Large language models meet NL2Code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, Toronto, Canada. Association for Computational Linguistics.
  45. Ai for mathematics: A cognitive science perspective. arXiv preprint arXiv:2310.13021.
  46. Mm-llms: Recent advances in multimodal large language models.
  47. Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark.
  48. A survey of large language models.
  49. Minif2f: a cross-system benchmark for formal olympiad-level mathematics.
  50. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Chaoqun He (5 papers)
  2. Renjie Luo (7 papers)
  3. Yuzhuo Bai (8 papers)
  4. Shengding Hu (34 papers)
  5. Zhen Leng Thai (4 papers)
  6. Junhao Shen (25 papers)
  7. Jinyi Hu (19 papers)
  8. Xu Han (270 papers)
  9. Yujie Huang (6 papers)
  10. Yuxiang Zhang (104 papers)
  11. Jie Liu (492 papers)
  12. Lei Qi (84 papers)
  13. Zhiyuan Liu (433 papers)
  14. Maosong Sun (337 papers)
Citations (30)