HARP: A challenging human-annotated math reasoning benchmark (2412.08819v1)

Published 11 Dec 2024 in cs.LG

Abstract: Math reasoning is becoming an ever increasing area of focus as we scale LLMs. However, even the previously-toughest evals like MATH are now close to saturated by frontier models (90.0% for o1-mini and 86.5% for Gemini 1.5 Pro). We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO). Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy). These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro). Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written, ground-truth solutions per problem, offering new avenues of research that we explore briefly. We report evaluations for many frontier models and share some interesting analyses, such as demonstrating that frontier models across families intrinsically scale their inference-time compute for more difficult problems. Finally, we open source all code used for dataset construction (including scraping) and all code for evaluation (including answer checking) to enable future research at: https://github.com/aadityasingh/HARP.

Summary

The paper presents HARP, a novel dataset of 5,409 math problems curated from US competitions spanning six difficulty levels.
The evaluation shows that frontier LLMs, such as o1-mini, face significant drops in accuracy on the most challenging problems, with scores as low as 41.1%.
The dataset’s dual-format design and presence of multiple human-written solutions enable robust error analysis and exploration of diverse reasoning strategies.

An Analysis of HARP: A Math Reasoning Benchmark

The paper "HARP: A challenging human-annotated math reasoning benchmark" introduces the Human Annotated Reasoning Problems (HARP) dataset, a significant addition to the landscape of math reasoning benchmarks for LLMs. This work presents a dataset of 5,409 math problems sourced from prestigious US national math competitions, marking a deliberate response to the saturation observed in existing benchmarks such as MATH and GSM8k. The authors provide both a quantitative assessment of frontier models' performance on their dataset and a robust qualitative framework through which future research can expand upon the foundational work laid by HARP.

Dataset Composition and Construction

HARP's dataset spans six difficulty levels and includes problems from A(J)HSME, AMC, AIME, and USA(J)MO contests. Significant features include 4,780 problems suitable for automatic answer checking using libraries like SymPy, of which 4,110 are also formatted as multiple-choice questions. Moreover, the dataset offers an average of two human-written ground-truth solutions per problem, providing avenues for exploring solution diversity and error analysis. This construction draws directly from the publicly available AoPS Wiki, ensuring that any overlap with previous works like OmniMATH and MATH is minimal, lending integrity and novelty to HARP.

Evaluation and Model Performance

The paper measures the performance of several frontier models, including Gemini, Claude, Llama, and o1, on HARP, with a particular focus on their performance across varied difficulty levels. Despite models nearing or achieving saturation on other benchmarks (e.g., 90% on MATH), HARP demonstrates that these models achieve significantly lower accuracies on its hardest problems, with o1-mini scoring 41.1%. The analyses reveal a pronounced decrement in performance with increasing problem difficulty, underscoring the dataset's challenge. Furthermore, models intrinsically scaling their reasoning length in response to problem difficulty, a phenomenon mirrored also in human solutions, further emphasizes the dataset's potential to dissect sophisticated problem-solving strategies.

Implications and Future Work

The implications of this research extend into both theoretical insights and practical applications in AI. From a theoretical perspective, HARP establishes itself as a pertinent benchmark for assessing genuine problem-solving capabilities beyond rote learning. Its design paves the way for future explorations into model behaviors like reasoning length scaling, as well as for developing methods to leverage diverse solution paths and multiple choice formats.

Practically, the open sourcing of both the dataset and evaluation code fosters transparency and ease of access, enabling broader participation in advancing LLM capabilities. Furthermore, HARP’s structure encourages multifaceted research directions, such as investigating inference-time compute adjustments, comparative studies on human versus model-generated solutions, and adaptive problem-solving strategies in AI, especially if the LLM field increasingly aligns towards complex problem-solving domains.

Conclusion

In conclusion, HARP stands as a well-calibrated benchmark that challenges the frontier of mathematical reasoning in LLMs. This paper effectively positions HARP not only as a needed tool for current models but also as a catalyst for future innovation in AI math reasoning. By addressing gaps and pushing boundaries in the evaluation landscape, it holds the potential to significantly influence both the development and assessment of future LLM capabilities.

PDF Markdown

Related Papers

GitHub

GitHub - aadityasingh/HARP (8 stars)

Tweets

https://twitter.com/Aaditya6284/status/1858737661294637415

https://twitter.com/rohanpaul_ai/status/1867730373209469194