Solving Inequality Proofs with Large Language Models (2506.07927v1)

Published 9 Jun 2025 in cs.AI, cs.CL, and cs.LG

Abstract: Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for LLMs, offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.

Summary

The paper reformulates complex inequality proofs into two verifiable subtasks—bound estimation and relation prediction—to better align with LLM strengths.
It presents IneqMath, a dataset featuring 1,252 training problems and 200 expert-curated test cases with rich step-wise and theorem annotations.
Experimental results show that while LLMs achieve moderate answer accuracy, they struggle with logical rigor, emphasizing the need for theorem-guided reasoning and self-improvement.

Mathematical inequality proving is a challenging frontier for LLMs, demanding advanced reasoning skills like discovering bounds, applying theorems strategically, and performing precise symbolic transformations. Existing datasets for this task are often scarce, synthetic, or rely on rigid formal systems, which don't align well with LLMs' natural language processing strengths. The paper "Solving Inequality Proofs with LLMs" (2506.07927) addresses this by proposing an informal yet verifiable task formulation and releasing IneqMath, a new benchmark and training corpus for Olympiad-level inequalities.

The core practical contribution is reformulating the complex task of generating a full, formally verifiable proof into two automatically checkable subtasks presented in natural language and LaTeX:

Bound Estimation: Given expressions $f(\mathbf{x})$ and $g(\mathbf{x})$ over variables $\mathbf{x}$ in a domain $\mathcal{D}$ , find the maximal constant $C$ such that $f(\mathbf{x}) \geq C g(\mathbf{x})$ (or minimal for $\leq$ ) holds for all $\mathbf{x} \in \mathcal{D}$ . The answer is a single constant value.
Relation Prediction: Given expressions $f(\mathbf{x})$ and $g(\mathbf{x})$ over $\mathbf{x}$ in $\mathcal{D}$ , determine the correct relation ( $>, \ge, =, \le, <$ , or none) between $f(\mathbf{x})$ and $g(\mathbf{x})$ for all $\mathbf{x} \in \mathcal{D}$ . This is framed as a multiple-choice problem.

This informal yet verifiable formulation allows LLMs to leverage their natural language understanding while still providing objective ground truth for evaluation of the final answer.

To support this task, the paper introduces the IneqMath dataset. The test set consists of 200 novel, expert-curated problems designed by IMO-level medalists to minimize contamination. The training corpus comprises 1,252 problems sourced from advanced textbooks, automatically rephrased into the bound and relation subtasks by LLMs, and then meticulously reviewed and corrected by human experts. A key feature of the training data is the rich annotation, including step-wise solutions (up to four per problem) and theorem annotations (76.8% of problems are linked to 83 named theorems across 29 categories). This detailed structure is designed to facilitate training models on generating step-by-step reasoning and applying relevant mathematical knowledge.

For rigorous evaluation, the paper develops a novel LLM-as-judge framework, a practical implementation using other LLMs to grade solutions. This framework moves beyond simple final answer checking by incorporating five judges:

Final Answer Judge: Uses prompt engineering to extract the model's final answer from the response and performs mathematical equivalence checking against the ground truth, handling variations in phrasing or numerical format. For example, it verifies if C=1/sqrt(2) is equivalent to C=sqrt(2)/2.
- Implementation: A prompt instructs a powerful LLM (like GPT-4o mini) to identify the answer statement. For bound problems, it extracts the numerical/symbolic value. For relation problems, it extracts the option letter (A-F). A subsequent prompt asks the LLM to verify the mathematical equivalence of the extracted answer and the ground truth, following strict rules about exact values vs. approximations.

Four Step-wise Judges: These judges scrutinize the reasoning process step-by-step to detect common flaws observed in pilot studies:

Toy Case Judge: Flags unjustified generalizations from specific numerical examples or extreme-case analysis used to conclude a general inequality direction.
- Implementation: A prompt asks the LLM to identify if toy cases or special value substitutions were used to justify the direction of an inequality for the entire domain, flagging such instances as invalid reasoning.
Logical Gap Judge: Identifies missing intermediate steps, unjustified claims, or conclusions asserted without adequate derivation or support (e.g., stating an optimal bound without showing the optimization process).
- Implementation: A prompt requires the LLM to check if all non-trivial claims and transformations are explicitly justified by algebra, theorems, or demonstrated analytical steps (like showing derivatives for optimization). Claims based on "numerical checks" must include actual values or results.
Numerical Approximation Judge: Detects inappropriate use of numerical approximations (e.g., $\sqrt{2} \approx 1.414$ $2 \approx 1.414$ ) within the logical derivation, which can compromise mathematical rigor.
- Implementation: A prompt instructs the LLM to identify if exact mathematical expressions were replaced with approximate decimal values and if these approximations were used in calculations or comparisons beyond simple, easily verifiable cases (like comparing $\sqrt{4}$ and $2$).

Numerical Computation Judge: Verifies the correctness of explicit numerical calculations performed during the solution process.

Implementation: A prompt identifies numerical equations (e.g., $3 + 27/27 + 2/3 = 4$), extracts them, and converts them into Python code snippets for evaluation, typically using floating-point comparison with tolerance or symbolic libraries like SymPy for exact arithmetic on fractions.

# Example verification for a numerical computation judge
from sympy import Rational
# Expression: phi(3) = 3 + 27/27 + 2/3 = 4
calculated_value = Rational(3) + Rational(27, 27) + Rational(2, 3)
expected_value = Rational(4)
answer = (calculated_value == expected_value) # Use symbolic equality for fractions
# Or for floating point:
# answer = abs(float(calculated_value) - float(expected_value)) < 1e-9

A solution is considered correct overall only if it passes all five judges. On a development set, these judges demonstrated strong agreement with human annotations (average F1 > 0.9), providing a scalable alternative to manual expert review.

Experimental evaluation of 29 leading LLMs on IneqMath reveals a striking discrepancy. While specialized reasoning models achieve higher final-answer accuracy than general chat models (e.g., o1 at 62.5% Answer Acc vs. GPT-4o at 37.5%), their performance drops drastically under step-wise scrutiny. Overall Accuracy (requiring both correct answer and sound steps) is significantly lower, often below 10% even for top models like o1 (8.0%). This large gap (up to 65.5%) exposes fragile deductive chains and highlights a critical limitation: current LLMs can often find correct answers but struggle to construct rigorous, logically sound proofs.

An in-depth analysis of failures shows that the most common errors are logical gaps and unjustified generalizations from toy cases. Scaling model size improves final-answer accuracy but has limited impact on overall proof correctness. Similarly, increasing test-time computation (allowing longer outputs) yields diminishing returns for overall accuracy, suggesting that simply generating more tokens or exploring longer reasoning paths is not sufficient for mathematical rigor.

The paper explores promising strategies for improvement:

Theorem-guided reasoning: Providing models with relevant "golden" theorems from the training data as hints can improve overall accuracy for stronger models (up to 11% for o3-mini). However, providing irrelevant theorems can be detrimental, indicating that effective theorem retrieval (e.g., using RAG techniques) is crucial for this approach to be consistently beneficial.
Self-improvement via critic as feedback: Allowing models to critique and refine their own generated solutions can increase overall accuracy (e.g., Gemini 2.5 Pro gains 5%). This self-refinement loop is a promising direction for enhancing logical rigor without requiring external supervision.

In summary, the paper introduces a practical approach to evaluating informal inequality proving using LLM-as-judge, provides the IneqMath dataset with rich annotations for training and benchmarking, and empirically demonstrates that current LLMs, despite moderate success in finding correct answers, lack the robustness needed for reliable step-by-step mathematical proof generation, paving the way for future research in theorem-guided reasoning and self-correction.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (7)

GitHub

Tweets

https://twitter.com/lupantech/status/1932866286427779586

https://twitter.com/james_y_zou/status/1932874386849345742

https://twitter.com/fly51fly/status/1934008157304701332

https://twitter.com/gm8xx8/status/1933088447033471274

https://twitter.com/rohanpaul_ai/status/1933678545819349331

https://twitter.com/HuggingPapers/status/1933024971787264444

YouTube

Show All Videos