MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs (2409.02257v3)
Abstract: Existing benchmarks for LLMs increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs. By incorporating questions with multiple correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex reasoning and resist simplistic problem-solving strategies. Our results show that MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of model discrimination, particularly in multi-correct answer scenarios. We introduce novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias. Evaluations of six state-of-the-art LLMs reveal significant performance gaps, highlighting variations in reasoning abilities and bias susceptibility. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/mmlu-pro-plus}.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Emily M Bender et al. On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623, 2021.
- Benjamin S Bloom et al. Taxonomy of educational objectives. Vol. 1: Cognitive domain. New York: McKay, 1956.
- Rishi et al. Bommasani. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- What will it take to fix benchmarking in natural language understanding? arXiv preprint arXiv:2104.02145, 2021.
- Tom Brown et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
- Jinze Bai et al. Qwen technical report, 2023.
- Deep Ganguli et al. Predictability and surprise in large generative models. arXiv preprint arXiv:2202.07785, 2022.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Fracgpt: Reasoning and hybrid recovery of fraction arithmetic. arXiv preprint arXiv:2307.14809, 2023.
- Melanie Mitchell. Abstraction and analogy-making in artificial intelligence. Annals of the New York Academy of Sciences, 1505(1):79–101, 2021.
- Colin Raffel et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020.
- Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Aideen Rodriguez et al. Open-ended questions in the wild: A case study of large language model evaluation. arXiv preprint arXiv:2306.15757, 2023.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2023.
- Hugo et al. Touvron. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, 2019.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2018.
- Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2308.07673, 2023.
- Ifeval: Instruction following evaluation for large language models. arXiv preprint arXiv:2310.00567, 2023.
- Musr: Multi-task learning for ultra-fine entity typing and semantic role labeling. arXiv preprint arXiv:2305.18245, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.