Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes (2312.14890v4)

Published 22 Dec 2023 in cs.AI, cs.CC, cs.CL, and cs.LG
NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes

Abstract: Complex reasoning ability is one of the most important features of current LLMs, which has also been leveraged to play an integral role in complex decision-making tasks. Therefore, the investigation into the reasoning capabilities of LLMs is critical: numerous benchmarks have been established to assess the reasoning abilities of LLMs. However, current benchmarks are inadequate in offering a rigorous evaluation of the full extent of reasoning abilities that LLMs are capable of achieving. They are also prone to the risk of overfitting, as these benchmarks, being publicly accessible and static, allow models to potentially tailor their responses to specific benchmark metrics, thereby inflating their performance. Addressing these limitations, our research introduces a new benchmark, named NPHardEval. This benchmark is designed to evaluate the reasoning abilities of LLMs across a broad spectrum of 900 algorithmic questions, extending up to the NP-Hard complexity class. These questions are meticulously chosen to represent a wide range of complexity class below the NP-hard complexity class, offering a rigorous measure of the reasoning ability of LLMs. Through this study, we shed light on the current state of reasoning in LLMs, providing an objective and rigorous perspective through the comparison of LLMs' performance across complex classes. Moreover, this benchmark is designed with a dynamic update mechanism, where the datapoints are refreshed on a monthly basis. Such regular updates play a crucial role in mitigating the risk of LLMs overfitting to the benchmark, promoting a more accurate and reliable assessment of their reasoning capabilities. The benchmark dataset and code of NPHardEval are available at https://github.com/casmlab/NPHardEval.

Introduction to the Evaluation Benchmark

In the landscape of AI, particularly in the capabilities of LLMs, reasoning ability stands as a critical attribute, especially as these models are increasingly employed in complex problem-solving domains. A new benchmark named NPHardEval has been introduced to evaluate reasoning abilities, involving 900 algorithmic questions reaching the NP-Hard complexity level. This dynamic benchmark is uniquely designed to circumvent the overfitting issues prevalent in static benchmarks by refreshing its questions on a monthly basis.

Task Design and Model Assessment

NPHardEval provides a finely-tuned structure of nine tasks, each categorized into specific complexity classes (P, NP-complete, and NP-hard), and subdivided into ten difficulty levels. This graded system of tasks not only captures the reasoning capacity of LLMs but also reflects the challenges encountered in real-world problem-solving across various industries. Moreover, the benchmark stands out with its automated generation and evaluation mechanisms that amplify the reliability and accuracy of assessments. The tasks chosen purposefully omit math-intensive problems, honing in on pure logical reasoning challenges.

Insights from Initial Findings

Upon comparing several LLMs using the NPHardEval benchmark, distinct patterns emerged. Closed-source models typically showed superior reasoning performance over open-source counterparts across all complexity classes, with a conspicuous trend of diminishing accuracy and increasing failure rates as task difficulty escalated. Notably, GPT-4 consistently performed well, suggesting its robustness in approaching complex tasks.

In-context Learning and Future Directions

Evaluating the models' ability to generalize from provided examples revealed a disparate picture. Closed-source models exhibited the potential to genuinely learn and apply algorithmic skills, as indicated by a consistent performance across varying example difficulties. On the other hand, open-source models often struggled, particularly when the examples were simpler than the test questions. These results underline not only the raw reasoning capabilities of LLMs but also their ability—or lack thereof—to learn in a broader sense.

Looking ahead, NPHardEval will deploy updates to maintain relevance in the fast-evolving LLM arena. The focus will be on enhancing the evaluation framework, for example, to better represent complexity or to integrate multi-model interactions. These enhancements will pave the way for more realistic assessments of LLM capabilities, providing invaluable insights for their advancement and application in demanding cognitive tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. A bibliometric review of large language models research from 2017 to 2023. arXiv preprint arXiv:2304.02020, 2023.
  2. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  3. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  4. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.
  5. Theoremqa: A theorem-driven question answering dataset. arXiv preprint arXiv:2305.12524, 2023.
  6. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  7. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
  8. Rylan Schaeffer. Pretraining on the test set is all you need. arXiv preprint arXiv:2309.08632, 2023.
  9. Mathematical capabilities of chatgpt. arXiv preprint arXiv:2301.13867, 2023.
  10. David S Johnson. A catalog of complexity classes. In Algorithms and complexity, pages 67–161. Elsevier, 1990.
  11. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
  12. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
  13. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  14. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  15. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  16. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
  17. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  18. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
  19. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  20. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  21. Iteratively prompt pre-trained language models for chain of thought. arXiv preprint arXiv:2203.08383, 2022.
  22. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  23. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  24. Recmind: Large language model powered agent for recommendation. arXiv preprint arXiv:2308.14296, 2023.
  25. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.
  26. Large language models are reasoners with self-verification. arXiv preprint arXiv:2212.09561, 2022.
  27. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  28. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 14, 2023.
  29. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
  30. Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474, 2023.
  31. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  32. Superclue: A comprehensive chinese large language model benchmark. arXiv preprint arXiv:2307.15020, 2023.
  33. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  34. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.
  35. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  36. Sosd: A benchmark for learned indexes, 2019.
  37. Exact methods for the traveling salesman problem with drone. Transportation Science, 55(2):315–335, 2021.
  38. Shamim Ahmed. Applications of graph coloring in modern computer science. International Journal of Computer and Information Technology, 3(2):1–7, 2012.
  39. Competitive algorithms for the online multiple knapsack problem with application to electric vehicle charging. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 4(3):1–32, 2020.
  40. Michael Cho. The knapsack problem and its applications to the cargo loading problem. Anal. Appl. Math, 13:48–63, 2019.
  41. Register allocation with graph coloring by ant colony optimization. In 2011 30th International Conference of the Chilean Computer Science Society, pages 247–255. IEEE, 2011.
  42. Constraint solving approaches to the business-to-business meeting scheduling problem. Journal of Artificial Intelligence Research, 74:263–301, 2022.
  43. Google DeepMind. Gemini.
  44. Phi-2.
  45. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  46. Introduction to Algorithms, fourth edition. MIT Press, 2022.
  47. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
  48. Gpt-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems. arXiv preprint arXiv:2310.12397, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lizhou Fan (23 papers)
  2. Wenyue Hua (51 papers)
  3. Lingyao Li (38 papers)
  4. Haoyang Ling (2 papers)
  5. Yongfeng Zhang (163 papers)
Citations (26)
Github Logo Streamline Icon: https://streamlinehq.com