- The paper presents HARP, a novel dataset of 5,409 math problems curated from US competitions spanning six difficulty levels.
- The evaluation shows that frontier LLMs, such as o1-mini, face significant drops in accuracy on the most challenging problems, with scores as low as 41.1%.
- The dataset’s dual-format design and presence of multiple human-written solutions enable robust error analysis and exploration of diverse reasoning strategies.
An Analysis of HARP: A Math Reasoning Benchmark
The paper "HARP: A challenging human-annotated math reasoning benchmark" introduces the Human Annotated Reasoning Problems (HARP) dataset, a significant addition to the landscape of math reasoning benchmarks for LLMs. This work presents a dataset of 5,409 math problems sourced from prestigious US national math competitions, marking a deliberate response to the saturation observed in existing benchmarks such as MATH and GSM8k. The authors provide both a quantitative assessment of frontier models' performance on their dataset and a robust qualitative framework through which future research can expand upon the foundational work laid by HARP.
Dataset Composition and Construction
HARP's dataset spans six difficulty levels and includes problems from A(J)HSME, AMC, AIME, and USA(J)MO contests. Significant features include 4,780 problems suitable for automatic answer checking using libraries like SymPy, of which 4,110 are also formatted as multiple-choice questions. Moreover, the dataset offers an average of two human-written ground-truth solutions per problem, providing avenues for exploring solution diversity and error analysis. This construction draws directly from the publicly available AoPS Wiki, ensuring that any overlap with previous works like OmniMATH and MATH is minimal, lending integrity and novelty to HARP.
The paper measures the performance of several frontier models, including Gemini, Claude, Llama, and o1, on HARP, with a particular focus on their performance across varied difficulty levels. Despite models nearing or achieving saturation on other benchmarks (e.g., 90% on MATH), HARP demonstrates that these models achieve significantly lower accuracies on its hardest problems, with o1-mini scoring 41.1%. The analyses reveal a pronounced decrement in performance with increasing problem difficulty, underscoring the dataset's challenge. Furthermore, models intrinsically scaling their reasoning length in response to problem difficulty, a phenomenon mirrored also in human solutions, further emphasizes the dataset's potential to dissect sophisticated problem-solving strategies.
Implications and Future Work
The implications of this research extend into both theoretical insights and practical applications in AI. From a theoretical perspective, HARP establishes itself as a pertinent benchmark for assessing genuine problem-solving capabilities beyond rote learning. Its design paves the way for future explorations into model behaviors like reasoning length scaling, as well as for developing methods to leverage diverse solution paths and multiple choice formats.
Practically, the open sourcing of both the dataset and evaluation code fosters transparency and ease of access, enabling broader participation in advancing LLM capabilities. Furthermore, HARP’s structure encourages multifaceted research directions, such as investigating inference-time compute adjustments, comparative studies on human versus model-generated solutions, and adaptive problem-solving strategies in AI, especially if the LLM field increasingly aligns towards complex problem-solving domains.
Conclusion
In conclusion, HARP stands as a well-calibrated benchmark that challenges the frontier of mathematical reasoning in LLMs. This paper effectively positions HARP not only as a needed tool for current models but also as a catalyst for future innovation in AI math reasoning. By addressing gaps and pushing boundaries in the evaluation landscape, it holds the potential to significantly influence both the development and assessment of future LLM capabilities.