Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 188 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 78 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs (2409.02257v3)

Published 3 Sep 2024 in cs.CL and cs.LG

Abstract: Existing benchmarks for LLMs increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs. By incorporating questions with multiple correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex reasoning and resist simplistic problem-solving strategies. Our results show that MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of model discrimination, particularly in multi-correct answer scenarios. We introduce novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias. Evaluations of six state-of-the-art LLMs reveal significant performance gaps, highlighting variations in reasoning abilities and bias susceptibility. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/mmlu-pro-plus}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  3. Emily M Bender et al. On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623, 2021.
  4. Benjamin S Bloom et al. Taxonomy of educational objectives. Vol. 1: Cognitive domain. New York: McKay, 1956.
  5. Rishi et al. Bommasani. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  6. What will it take to fix benchmarking in natural language understanding? arXiv preprint arXiv:2104.02145, 2021.
  7. Tom Brown et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
  9. Jinze Bai et al. Qwen technical report, 2023.
  10. Deep Ganguli et al. Predictability and surprise in large generative models. arXiv preprint arXiv:2202.07785, 2022.
  11. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
  12. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  13. Fracgpt: Reasoning and hybrid recovery of fraction arithmetic. arXiv preprint arXiv:2307.14809, 2023.
  14. Melanie Mitchell. Abstraction and analogy-making in artificial intelligence. Annals of the New York Academy of Sciences, 1505(1):79–101, 2021.
  15. Colin Raffel et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020.
  16. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
  17. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  18. Aideen Rodriguez et al. Open-ended questions in the wild: A case study of large language model evaluation. arXiv preprint arXiv:2306.15757, 2023.
  19. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2023.
  20. Hugo et al. Touvron. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  21. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, 2019.
  22. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2018.
  23. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024.
  24. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  25. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2308.07673, 2023.
  26. Ifeval: Instruction following evaluation for large language models. arXiv preprint arXiv:2310.00567, 2023.
  27. Musr: Multi-task learning for ultra-fine entity typing and semantic role labeling. arXiv preprint arXiv:2305.18245, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 2 likes.

Upgrade to Pro to view all of the tweets about this paper: