Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IneqMath Dataset

Updated 1 July 2025
  • IneqMath is an expert-curated dataset of Olympiad-level inequality problems designed to evaluate the mathematical reasoning and proof generation capabilities of large language models (LLMs).
  • The dataset features 1,252 training problems with stepwise human-annotated solutions, including theorem references, focusing on algebraic, analytic, and geometric inequalities.
  • It introduces a robust LLM-as-judge evaluation framework that assesses both the final answer and the logical soundness of each step in a solution, revealing significant gaps in LLM reasoning rigor.

The IneqMath Dataset is a comprehensive, expert-curated benchmark designed to rigorously evaluate mathematical reasoning—particularly inequality proving—by LLMs. Developed in response to the scarcity and shortcomings of prior datasets, IneqMath targets Olympiad-level inequalities and incorporates stepwise solution annotations, theorem references, and a technically robust evaluation framework. Its primary aim is to probe not only answer retrieval, but also the construction of verifiable, logically sound proofs.

1. Dataset Structure and Coverage

IneqMath encompasses a broad spectrum of advanced inequality problems, focusing on domains that challenge both experienced mathematicians and state-of-the-art AI systems:

  • Problem Types: Algebraic, analytic, and geometric inequalities, each formulated at Olympiad or higher levels.
  • Subtasks:
    • Bound Estimation: Given f(x)Cg(x)f(\mathbf{x}) \geq C g(\mathbf{x}) (or f(x)Cg(x)f(\mathbf{x}) \leq C g(\mathbf{x})), determine the maximal (or minimal) constant CC^\star ensuring the inequality holds over a specified domain D\mathcal D:

    C=sup{CR:f(x)Cg(x), xD}C^\star = \sup\{C \in \mathbb{R} : f(\mathbf{x}) \geq Cg(\mathbf{x}),\ \forall \mathbf{x} \in \mathcal{D}\} - Relation Prediction: Select the correct universal relation (>,,=,,<,>,\geq,=,\leq,<, or "None of the above") between f(x)f(\mathbf{x}) and g(x)g(\mathbf{x}) over D\mathcal D.

  • Scale and Composition:

    • Test Set: 200 expert-constructed problems for leaderboard benchmarking.
    • Development Set: 100 problems with detailed ground truth and fine-grained annotations.
    • Training Set: 1,252 problems—each supplied with up to four stepwise, human-reviewed solution paths.
    • Annotations: 962 training problems include explicit references to 83 named theorems across 29 categories, such as AM-GM, Cauchy-Schwarz, Jensen’s, and Minkowski’s inequalities.
  • Representation: Problems and solutions are articulated in informal (yet precise) mathematical language, utilizing LaTeX notation for clarity and accessibility.

Examples:

  • Bound Estimation: “Find the maximal CC so that for all real a,b,ca, b, c,

a2+(1b)2+b2+(1c)2+c2+(1a)2C\sqrt{a^2+(1-b)^2} + \sqrt{b^2+(1-c)^2} + \sqrt{c^2+(1-a)^2} \geq C

  • Relation Prediction: “Let a,b,c>0a, b, c > 0 with abc=1abc=1.

b+ca+c+ab+a+bc()a+b+c+3\frac{b+c}{\sqrt{a}} + \frac{c+a}{\sqrt{b}} + \frac{a+b}{\sqrt{c}}\quad (\quad) \quad \sqrt{a}+\sqrt{b}+\sqrt{c}+3

2. Expert Annotation and Theorem Tagging

IneqMath is distinguished by its inclusion of step-by-step human-crafted solutions and explicit theorem annotation:

  • Stepwise Solutions: Each training sample includes up to four solution variants, detailing algebraic manipulations, theorem applications, and logical deductions. Solutions are constructed to demonstrate alternate strategies, ensuring coverage of common mathematical approaches.
  • Theorem Annotations: 76.8% of the training set is annotated with theorems applied in the solution path, spanning the breadth of competition-mathematics techniques:
    • AM-GM (Arithmetic Mean–Geometric Mean)
    • Cauchy-Schwarz
    • Jensen’s Inequality
    • Minkowski’s Inequality
    • Schur’s Inequality
    • Triangle inequality and others
  • Format: All annotations and solutions are presented in JSON and LaTeX-formatted plaintext, optimized for both human readability and machine parsing.

3. Evaluation Framework: LLM-as-Judge

A distinguishing feature of IneqMath is its LLM-as-judge evaluation paradigm, engineered for stepwise and final-answer rigor:

  • Final-Answer Judge: Assesses mathematical equivalence of end results, tolerant of different admissible forms (e.g., C=12C=\frac{1}{\sqrt{2}} vs. C=2/2C = \sqrt{2}/2).
  • Four Stepwise Judges, each targeting a common reasoning flaw:

    1. Toy Case Judge: Detects unjustified generalization from single or “toy” examples.
    2. Logical Gap Judge: Flags missing justifications, skipped sub-steps, or unwarranted WLOG reductions.
    3. Numerical Approximation Judge: Penalizes reliance on decimal/rounded approximations where only exact values are permissible.
    4. Numerical Computation Judge: Verifies explicit arithmetic correctness after substitution or analytic manipulation.

A solution must pass all five criteria (final plus four stepwise) to be scored as correct. Validation against expert annotations demonstrates high alignment (average F1 = 0.93).

4. Model Performance and Insights

IneqMath provides unprecedented granularity in exposing LLM performance gaps:

  • Discrepancy between Answer and Rigor: State-of-the-art reasoning LLMs, such as OpenAI o1, display a steep accuracy drop from answer-level scoring (up to 62.5%) to stepwise-rigorous scoring (below 10%). The average drop can reach 65.5% across models.

  • Common Failure Modes: Logical gaps (85% of incorrect solutions) and overgeneralization from toy cases (60%) dominate. Many models produce plausible answers—often using shortcut heuristics—without constructing logically valid proofs.

  • Scale and Computation: Increasing model size or search depth yields only marginal improvements in fully rigorous proofs; deficiencies in deductive structure persist.

  • Improvement via Theorem Guidance and Self-Refinement: Supplying theorem statements as retrieval-augmented context and prompting models to critique and revise their proofs raises rigorous solution rates by up to 11% and 5% respectively, suggesting promise in these directions.

5. Technical Formulations and Data Accessibility

IneqMath is formulated for maximum relevance and reproducibility:

  • Problem Types:

    • Bound inference: Πbound=(f(x),g(x),D)\Pi_{\text{bound}} = (f(\mathbf{x}),\, g(\mathbf{x}),\, \mathcal D); infer the optimal bound CC^*.
    • Relation inference: Πrel=(f(x),g(x),D)\Pi_{\text{rel}} = (f(\mathbf{x}),\, g(\mathbf{x}),\, \mathcal D); predict universally valid relations: >,,=,,<,>,\geq,=,\leq,<, or none.
  • Stepwise Solution Format: Each solution path features precise algebraic reasoning:
    • Theorem application (e.g., AM-GM: a1++ann(a1an)1/n\frac{a_1+\ldots+a_n}{n} \geq (a_1 \ldots a_n)^{1/n}).
    • Structural manipulations.
    • Analytic optimization and equality cases.
  • Data and Code:

6. Research Implications and Significance

IneqMath establishes a rigorous and comprehensive standard for evaluating mathematical reasoning in AI:

  • Quantifies Deductive Gaps: The dataset reveals that high answer accuracy by LLMs does not imply proof-level rigor, making it possible to track progress on verifiable deduction rather than answer synthesis alone.
  • Drives Research into Proof Robustness: IneqMath highlights the urgent need for advances in theorem-guided prompting, explicit tool retrieval, iteration/self-critique loops, and architecture design focused on proof-chain reliability.
  • Catalyst for Future Datasets: Its stepwise LLM-as-judge methodology and detailed annotation protocol offer a blueprint for subsequent benchmarks in mathematical reasoning and theorem proving.
  • Community Resource: Tools for annotation, problem submission, and result ranking enhance reproducibility and foster a collaborative ecosystem for continuous advancement.

Summary Table: Core Elements of IneqMath

Component Description
Problem Types Olympiad-level algebraic/analytic/geometric inequalities
Tasks Bound estimation, relation inference
Scale 1,252 training, 100 dev, 200 test problems + solutions
Annotation Stepwise, theorem-tagged, multiple solution paths
Evaluation LLM-as-judge, stepwise soundness + answer correctness
Public Access https://ineqmath.github.io/, HuggingFace, open-source

IneqMath represents a significant technical advance by providing a fine-grained, expert-oriented, and scalable resource for benchmarking and researching mathematical proof generation, specifically for inequalities. Its stepwise annotated solutions and advanced evaluation framework expose and quantify the distance yet to travel before AI tools can match human mathematical rigor and reasoning.