MathDuels: Evaluating LLMs as Problem Posers and Solvers

Published 23 Apr 2026 in cs.CL and cs.SE | (2604.21916v2)

Abstract: As frontier LLMs attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a self-play evaluation framework where LLMs both pose and solve math problems to continuously challenge performance.
It employs a three-stage generation pipeline with meta-prompting, problem formulation, and difficulty amplification, coupled with a dual-rating mechanism using the Rasch model and Elo transformation.
Empirical results on 19 LLMs reveal decoupled authoring and solving abilities, dynamic difficulty escalation, and consistent verification robustness.

MathDuels: Dual-Role Evaluation of LLM Mathematical Competence

Motivation and Limitations of Static Benchmarks

The proliferation of static mathematical benchmarks for LLMs, including MATH and GSM8K, has led to near-ceiling performances by current frontier models, reducing the discriminative power of these benchmarks. Fixed or even annually refreshed problem pools are increasingly unable to keep pace with the rate of model improvement. Human-curated, research-level problems—while an alternative—are unsustainable as a consistent source of adversarial evaluation. This scenario motivates the need for automated, dynamic, and model-adaptive protocols that both evaluate and continuously challenge frontier LLMs on mathematical reasoning.

MathDuels Protocol: Self-Play Model Evaluation

MathDuels introduces a self-play framework in which LLMs simultaneously serve as problem posers and solvers. Each model generates math problems designed to challenge adversaries under adversarial prompting and must solve problems authored by every other model in the pool. The protocol is entirely automated and consists of independent phases: problem generation, solving, verification, and ranking.

Problem Generation Pipeline

MathDuels employs a three-stage generation process:

Meta-Prompting: The model generates a meta-prompt which guides the subsequent formulation of a challenging problem in a specific mathematical domain.
Problem Generation: The model instantiates a problem (statement and gold answer) based on the meta-prompt.
Difficulty Amplification: The model iteratively hardens the problem to increase reasoning demands, yielding a more adversarial challenge.

Each model’s generation is unconstrained beyond the prescribed difficulty orientation and domain, allowing for maximal diversity and creativity in problem posing.

Solving and Verification

After problem generation, all non-author models attempt to solve each problem, producing final answers and reasoning traces. Symbolic equivalence checking determines solution correctness. For problems where not all solvers agree, a verification phase is triggered using an independent LLM-based verifier. The verifier filters out ill-posed items and adjudicates on a consensus solution, if any.

Dual-Rating Mechanism

Capability estimation is formalized via the Rasch model, a logistic latent-trait model standard in IRT, capturing both model (solver) ability and problem (item) difficulty via maximized log-likelihood over solve observations. Elo rating scale transformations anchor solver abilities for comparability across runs. Author rating is defined as the mean scaled difficulty of valid, correctly-solved problems written by a model, with safeguards to prevent inflation via invalid, ambiguous, or unsolved items. The composite model rating weighs solver and author axes equally.

Empirical Results and Observations

Experiments were conducted on 19 frontier LLMs from nine providers, using budgets of 30 problems per model across six mathematical domains. Key findings include:

Decoupled Authoring and Solving Abilities: Solving and authoring capacities are only partially correlated. For instance, GPT-5.4-high demonstrates the highest solver rating, yet Gemini-3.1-Pro-high achieves the top composite rating due to superior authoring capability (its authored problems obtained the lowest non-author solve rate, 62.9%). Grok-4.20-high exhibited the largest disparity between solving and authoring ratings.
Dynamic Difficulty Increase and Benchmark Saturation Resistance: Stronger models author more challenging problems that selectively defeat previous top performers, thereby continuously escalating the evaluation ceiling. For example, newcomer-authored problems broke earlier top-3 solvers at a rate 3.4× higher than the average participant.
Problem Area Effects: Domains show varying model performance. Discrete mathematics produced the lowest mean solve rates, attributable to the pipeline’s tendency to yield compact, verification-friendly, structure-concealed problems, while probability and statistics yielded the highest rates, reflecting more routine combinatorial arguments in the sampled pool.
Generation Pipeline Ablation: Each enhancement (meta-prompting, amplification) roughly doubled the error rate, substantiating their impact on problem hardness. Tool-augmented inference provided only modest error reduction, confirming that MathDuels’ challenge is not computational but fundamentally reasoning-based.

Robustness and Statistical Stability

Rating Stability: Bootstrap CIs indicate robust composite rank clusters at both the performance extremes, with more volatility among mid-tier models (mean worst-case rank range 5.05 out of 19 at $K=30$ ).
Verifier Backbone Consistency: Adjudication using an alternative strong LLM backbone resulted in 97.5% agreement on problem inclusion/exclusion and 99.4% agreement on final answer selection for non-excluded problems. This robustness mitigates concerns about verifier-induced bias in benchmark outcomes.

Implications and Prospects

Theoretical Implications

MathDuels operationalizes a self-play, adversarial dynamics analogous to Elo-based rating systems and competitive game self-play in RL (e.g., AlphaZero), but with rigorous symbolic verification instead of subjective or human preference outcomes. The Rasch model’s latent trait joint fitting provides interpretable and transferable axes for model assessment.

Practical Relevance

By rewarding both problem-posing creativity and solution proficiency, MathDuels exposes previously unmeasurable axes of mathematical competence, facilitating more fine-grained ranking, model diagnosis, and tracking of emergent capabilities. The framework is fully automated, domain-agnostic, and scales with benchmarking needs across AI modalities, with potential applications to programming, scientific reasoning, and iterative evaluation of model robustness.

Future Developments

Potential extensions include:

Scaling Match Budgets: Higher problem-per-model budgets would yield narrower rating CIs and more reliable rankings.
Proof-Based Scoring: Going beyond final-answer verification to proof-verification would further stress reasoning completeness and rigor.
Domain Generalization: Adaptation to coding, science, or other LLM benchmarks with verifiable outcomes.
Meta-Evaluation and Diagnostic Analytics: Richer analysis of problem typology, authoring patterns, and capability gaps can inform targeted LM improvement.

Conclusion

MathDuels establishes a dynamic, self-play evaluation paradigm for LLM mathematical reasoning that resists saturation and exposes multifaceted competence axes via head-to-head author-solver competition. As LLMs approach advanced mathematical proficiency and assist with open problem solving, dual-role evaluation with co-evolving difficulty will be critical for both model selection and incremental capability tracking. The framework is adaptable to other domains where challenging test item generation is itself a discriminative skill.

Reference: "MathDuels: Evaluating LLMs as Problem Posers and Solvers" (2604.21916)

Markdown Report Issue