Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap (2402.19450v1)
Abstract: We propose a framework for robust evaluation of reasoning capabilities of LLMs, using functional variants of benchmarks. Models that solve a reasoning test should exhibit no difference in performance over the static version of a problem compared to a snapshot of the functional variant. We have rewritten the relevant fragment of the MATH benchmark into its functional variant MATH(), with functionalization of other benchmarks to follow. When evaluating current state-of-the-art models over snapshots of MATH(), we find a reasoning gap -- the percentage difference between the static and functional accuracies. We find reasoning gaps from 58.35% to 80.31% among the state-of-the-art closed and open weights models that perform well on static benchmarks, with the caveat that the gaps are likely to be smaller with more sophisticated prompting strategies. Here we show that models which anecdotally have good reasoning performance over real-world tasks, have quantifiable lower gaps, motivating the open problem of building "gap 0" models. Code for evaluation and new evaluation datasets, three MATH() snapshots, are publicly available at https://github.com/consequentai/fneval/.
- Long context prompting for claude 2.1, 2023.
- 01.AI. Building the next generation of open-source and bilingual llms, 2023.
- Rest meets react: Self-improvement for multi-step reasoning llm agent, 2023.
- Anthropic. Introducing claude 2.1, 2023.
- Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023.
- Program synthesis with large language models. CoRR, abs/2108.07732, 2021.
- Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Worldsense: A synthetic benchmark for grounded reasoning in large language models, 2023.
- PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641, 2019.
- Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023.
- Teaching large language models to self-debug, 2023.
- Self-play fine-tuning converts weak language models to strong language models, 2024.
- Quac : Question answering in context. CoRR, abs/1808.07036, 2018.
- François Chollet. On the measure of intelligence, 2019.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
- Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024.
- DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. CoRR, abs/1903.00161, 2019.
- The capacity for moral self-correction in large language models, 2023.
- Pal: Program-aided language models, 2023.
- Roscoe: A suite of metrics for scoring step-by-step reasoning, 2023.
- Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024.
- Cruxeval: A benchmark for code reasoning, understanding and execution, 2024.
- Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020.
- Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021.
- Large language models are reasoning teachers, 2023.
- Large language models cannot self-correct reasoning yet, 2023.
- Mathprompter: Mathematical reasoning using large language models, 2023.
- Mixtral of experts, 2024.
- Swe-bench: Can language models resolve real-world github issues?, 2023.
- The impact of reasoning step length on large language models, 2024.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics.
- xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval, 2023.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019.
- Same task, more tokens: the impact of input length on the reasoning performance of large language models, 2024.
- Task contamination: Language models may not be few-shot anymore, 2023.
- Chain of code: Reasoning with a language model-augmented code emulator, 2023.
- Textbooks are all you need ii: phi-1.5 technical report, 2023.
- Holistic evaluation of language models, 2023.
- Let’s verify step by step, 2023.
- Llm+p: Empowering large language models with optimal planning proficiency, 2023.
- Chain of hindsight aligns language models with feedback, 2023.
- Wizardcoder: Empowering code large language models with evol-instruct, 2023.
- Divide and conquer for large language models reasoning, 2024.
- Gaia: a benchmark for general ai assistants, 2023.
- Can a suit of armor conduct electricity? A new dataset for open book question answering. CoRR, abs/1809.02789, 2018.
- Orca 2: Teaching small language models how to reason, 2023.
- Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
- Scalable extraction of training data from (production) language models, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Talm: Tool augmented language models, 2022.
- Certified deductive reasoning with language models, 2023.
- StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models, 12 2023.
- Gpqa: A graduate-level google-proof q and a benchmark, 2023.
- WINOGRANDE: an adversarial winograd schema challenge at scale. CoRR, abs/1907.10641, 2019.
- Socialiqa: Commonsense reasoning about social interactions. CoRR, abs/1904.09728, 2019.
- Arb: Advanced reasoning benchmark for large language models, 2023.
- Are emergent abilities of large language models a mirage?, 2023.
- The truth is in there: Improving reasoning in language models with layer-selective rank reduction, 2023.
- The curse of recursion: Training on generated data makes models forget, 2023.
- Beyond human data: Scaling self-training for problem-solving with language models, 2023.
- Learning by distilling context, 2022.
- Gpt-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems, 2023.
- A survey of reasoning with foundation models, 2024.
- Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. CoRR, abs/1811.00937, 2018.
- Gemini Team. Gemini: A family of highly capable multimodal models, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Llms cannot find reasoning errors, but can correct them!, 2024.
- The clrs algorithmic reasoning benchmark, 2022.
- Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024.
- Understanding the reasoning ability of language models from the perspective of reasoning paths aggregation, 2024.
- Self-consistency improves chain of thought reasoning in language models, 2023.
- Chain-of-thought reasoning without prompting, 2024.
- Self-instruct: Aligning language models with self-generated instructions, 2023.
- Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022.
- Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks, 2023.
- Pretraining data mixtures enable narrow model selection capabilities in transformer models, 2023.
- If llm is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents, 2024.
- Do large language models latently perform multi-hop reasoning?, 2024.
- Tree of thoughts: Deliberate problem solving with large language models, 2023.
- React: Synergizing reasoning and acting in language models, 2023.
- Benchmarking llms via uncertainty quantification, 2024.
- Self-taught optimizer (stop): Recursively self-improving code generation, 2023.
- Hellaswag: Can a machine really finish your sentence? CoRR, abs/1905.07830, 2019.
- R-tuning: Teaching large language models to refuse unknown questions, 2023.
- Planning with large language models for code generation, 2023.
- Large language models as an indirect reasoner: Contrapositive and contradiction for automated reasoning, 2024.
- Progressive-hint prompting improves reasoning in large language models, 2023.
- Agieval: A human-centric benchmark for evaluating foundation models, 2023.
- Ar-lsat: Investigating analytical reasoning of text, 2021.
- Least-to-most prompting enables complex reasoning in large language models, 2023.
- Teaching algorithmic reasoning via in-context learning, 2022.
- Self-discover: Large language models self-compose reasoning structures, 2024.