RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows (2510.09021v1)
Abstract: State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (overview)
Big AI models have recently gotten much better at solving hard math contest problems, like those from the International Mathematical Olympiad (IMO). But solving problems is only part of the story. In real contests and classrooms, graders don’t just say “right” or “wrong”—they give partial credit for good ideas and solid steps, even if the proof isn’t perfect.
This paper asks: can AI also grade math proofs fairly, especially when it comes to partial credit? The authors build and test “agentic workflows” (multi-step AI processes) that use reference solutions and custom rubrics to judge student-like proofs more like human graders do.
The key questions
The paper focuses on a few simple questions:
- Can AI reliably spot errors in math proofs?
- Can AI give fair partial credit (not just pass/fail)?
- Do reference solutions and step-by-step rubrics help the AI grade better?
- Which grading workflow works best and is most consistent with human judges?
How they did it (methods explained simply)
The authors studied grading on two collections of proofs:
- A set of 90 challenging IMO Shortlist problems (2017–2023). For each problem, they asked the Gemini 2.5 Pro model to write a solution, then humans annotated the errors and graded on a simple 1–4 scale (later mapped to 0–7).
- A larger set of 385 solutions from MathArena (IMO/USAMO 2025-style problems), already graded by human judges on a 0–7 scale.
They also tagged common types of mistakes (“fallacies”) in the model-generated proofs, using labels like “Proof by Example,” “Begging the Question (Circular Reasoning),” and “Calculation Mistakes.” This helped them understand exactly where and how the solutions went wrong.
First, they tried “single-turn grading”: give the problem and the solution to the AI once, and ask it to grade on a 0–7 scale. Then they introduced a more careful, multi-step “agentic workflow” that uses reference solutions (correct solutions collected from forums like Art of Problem Solving) to build a more tailored grading process.
To make the multi-step idea clear, here’s the core 5-step workflow the paper proposes. This list is the heart of their approach:
- The AI groups reference solutions by their approach (similar ideas are clustered together).
- It matches the student’s solution to the closest group (the most similar style or strategy).
- It breaks the chosen reference solution into its main “aha moments” (big ideas) and smaller substeps.
- It designs a problem-specific rubric, spreading 7 total points across those steps.
- It compares the student’s proof to the rubric, checks for direct mistakes or contradictions with the reference, and assigns a fair score.
They tested variations of this workflow:
- Plain: the basic 5 steps as above.
- Approachability: the rubric gives more points to steps that are harder to discover (harder “aha moments” are worth more).
- Milestones: the rubric is based on clear milestones—intermediate results that match the reference solution’s key checkpoints.
- Hybrid: combines “Approachability” and “Milestones.”
- A simpler 3-step version (“No Rubrics”): just add a reference solution to a single-turn grader without building rubrics.
To judge how close the AI’s grades were to human grades, they used several measures:
- Do the AI’s grades rise and fall with the human grades? (correlation)
- On average, how far off is the AI from the human score? (error in points)
- How often is the AI within 1 or 2 points of the human grade? (near-miss rates)
- Agreement measures designed for ordered scores (like 0–7) that account for chance and skew in the data.
You don’t need the formulas; think of them as multiple ways to check “Are the AI’s grades both accurate and fair, especially across many different kinds of solutions?”
What they found (main results and why they matter)
- Single-turn grading is okay at telling perfect solutions from imperfect ones. It can often say “this solution is correct” versus “this has problems.”
- But it struggles with partial credit. It tends to give too much credit to weak or half-baked solutions, especially in the low- to mid-score range. In other words, it’s overly optimistic about incomplete proofs.
- The multi-step reference-based workflows are much better. When the AI:
- matches the student’s approach to similar correct solutions,
- breaks the reference down into clear “aha” steps,
- and uses a problem-specific rubric,
it becomes more consistent with human judges and assigns fairer partial credit.
- Among the 5-step variants, “Approachability” and “Milestones” performed best overall:
- Approachability helps the AI value hard-to-find ideas more appropriately.
- Milestones provide clear checkpoints that are easy to match and grade.
- The simpler 3-step version (just add a reference solution without rubrics) helps somewhat but not as much as the full 5-step approach.
- The “Hybrid” (Approachability + Milestones together) didn’t beat the best single approaches—likely because “step difficulty” and “milestone progress” don’t always align perfectly.
- Importantly, the gains don’t come just from spending more compute or tokens. They tested sampling and averaging and found that the structure of the workflow itself (using references and rubrics) is what drives the improvements.
- Cost-wise, they note that some steps (like clustering references and designing rubrics) can be cached, so only matching and grading need to run per new solution.
Overall, the workflows:
- reduce over-scoring of weak solutions,
- better recognize partial progress (like getting big ideas but missing a tricky step),
- keep strong performance in identifying fully correct solutions,
- and agree more with human judges across multiple fairness and accuracy checks.
Why this matters (implications and impact)
- For math contests and classrooms: Fair partial credit is essential. A grader that understands the common solution paths and uses problem-specific rubrics can give students more consistent, transparent, and constructive feedback.
- For AI as judges: These workflows are more reliable and explainable. They tie grades to steps and reference solutions, making it clearer why a score was assigned.
- For training better AI problem solvers: The same grading system can act as a “reward model” for reinforcement learning—guiding AI toward complete, correct proofs by scoring step-by-step progress.
- For education technology: With the right guardrails and reference solutions, this approach could help teachers grade proofs faster and give students detailed hints on missing ideas or common mistakes.
The authors released their code, data, and prompts so others can build on their work. In short, as AI gets better at doing math, this paper shows a practical way to make AI better at judging math, too—especially the tricky part of partial credit.
Knowledge Gaps
Below is a concise, actionable list of knowledge gaps, limitations, and open questions that remain unresolved by the paper. Future work could address these points to strengthen evidence, broaden applicability, and improve reliability.
- Limited dataset scope and representativeness: only 90 IMO Shortlist problems with one Gemini 2.5 Pro solution each, plus MathArena IMO/USAMO 2025; unclear generalization to broader problem sets, difficulty levels, and solution distributions (e.g., human student proofs, research-style proofs).
- Sole reliance on model-generated solutions for the 90-problem corpus: no evaluation on human-written solutions; unclear robustness to human writing styles, non-linear narratives, omissions, or idiosyncratic notation.
- No reported inter-annotator agreement for the 1–4 human grades and fallacy annotations; reliability of ground truth labels and error tags is unknown.
- Coarse 1–4 grading scale mapped to 0–7 using m(x)=2x−1: potential distortion of error metrics and partial-credit granularity; no sensitivity analysis for alternative mappings.
- Zero-score subsampling on MathArena (p=0.14) used for figures/tables: possible bias in reported metrics; results on the full, zero-inflated distribution are not reported.
- Reference solution quality control: AoPS solutions may contain mistakes or non-standard steps; no procedure is described to verify reference correctness or to filter noisy/contradictory references.
- Handling novel or uncommon approaches: the workflow presumes matching to a similar reference; no mechanism or fallback is specified when the submitted solution’s approach has no close match in the reference set.
- Approach-agnostic correctness: reliance on “contradiction with the reference solution” risks penalizing correct but stylistically different approaches; the system’s ability to recognize equivalence of different intermediate statements is untested.
- Clustering and matching stability: the paper does not quantify clustering quality, matching accuracy, or sensitivity to the number/diversity of references; no ablation on mismatches or “no-good-match” detection thresholds.
- Rubric validity and alignment: LLM-induced rubrics (plain, approachability, milestones) are not validated against official contest rubrics or expert graders; it is unknown if point allocations are acceptable to human judges.
- Subjectivity of “approachability” weights: step difficulty judgments are model-generated and uncalibrated; no inter-rater or human–LLM agreement is reported for approachability scores.
- Bias toward reference-like reasoning: milestone-based scoring may undervalue creative reasoning or different decompositions leading to the same conclusion; fairness across diverse valid solution structures is unmeasured.
- Topic-dependent performance: no breakdown of grading accuracy by subject (e.g., geometry, number theory, combinatorics, algebra); sensitivity to diagram-heavy or construction-heavy problems remains open.
- Error detection analysis by fallacy type is missing: although fallacies are annotated in the 90-solution corpus, the grader’s recall/precision per fallacy category (and for subtle vs blatant errors) is not evaluated.
- Adversarial robustness: no tests on intentionally deceptive or superficially plausible-but-wrong proofs; vulnerability to strategic gaming of milestones or rubric phrasing is unknown.
- Error localization quality: the system outputs error tags/locations, but there is no metric-based evaluation of localization accuracy, coverage, or usefulness for feedback.
- Calibration to human graders: no comparison to human–human agreement (e.g., QWK/AC2 between human graders) to contextualize how close the LLM is to human consistency.
- Statistical uncertainty: no confidence intervals or significance tests reported; it is unclear whether observed improvements over baselines are statistically reliable across resamples.
- Cost–latency–scalability trade-offs: token usage, runtime, and caching savings are not quantified; feasibility for large-scale deployment (e.g., thousands of scripts) is unassessed.
- Model dependence and reproducibility: experiments primarily use Gemini 2.5 Pro as judge; cross-model robustness (e.g., Claude, open-source LLMs) and reproducibility without closed APIs are not evaluated.
- Solver–judge coupling risk: grading Gemini-generated solutions with Gemini-based judges may induce same-family bias; cross-family grading performance is unreported.
- Multimodality gap: many olympiad solutions (especially geometry) rely on diagrams; the workflow is text-centric and does not evaluate multimodal grading.
- Integration with formal verification: no hybrid pipeline combining learned rubrics with lightweight proof checkers or formal systems to validate key lemmas/claims.
- Sparse or absent references: the approach assumes plentiful high-quality references; the behavior and performance when references are few, noisy, or unavailable remain unclear.
- Matching and rubric errors compounding: no analysis of how mistakes in clustering/matching/step extraction propagate to grading; lack of robustness measures or error-correction mechanisms.
- Post-hoc calibration: the paper does not explore calibrating predicted grade distributions to human score distributions (e.g., ordinal calibration, isotonic/temperature scaling).
- Ensembling is promising but underexplored: cross-method averaging improved accuracy; systematic ensemble design, diversity analysis, and budget-aware ensembling strategies are not studied.
- Educational usability: no user studies with instructors/contest graders to assess rubric interpretability, transparency, and perceived fairness of partial-credit assignments and feedback.
- Human-in-the-loop triage: no mechanism for uncertainty estimation or deferral on borderline cases; triage strategies to allocate human review where the model is least certain are not explored.
- Longitudinal adaptability: how rubrics and reference libraries evolve over time, and whether graders can continually learn from newly graded scripts or adjudications, is unaddressed.
- Cross-lingual generalization: references and solutions are in English; grading performance on non-English proofs and multilingual reference sets is unknown.
- Ethical and privacy considerations: scraping and using community solutions (AoPS) as references raises questions about licensing, attribution, and potential leakage of contest materials; not discussed.
Practical Applications
Immediate Applications
Based on the paper’s findings and released code/data, the following applications can be deployed now with human oversight and appropriate references.
- Education (Higher-ed, secondary, MOOCs): Autograding of proof-based assignments and exams
- What: Use the 5-step Ref-Grader (Approachability or Milestones) to grade student proofs with calibrated partial credit and step-referenced feedback (aha-moment steps, error types, missing milestones).
- Tools/products/workflows: LMS plugin or API service (“RefGrader API”), instructor dashboard for rubric induction and caching (steps 1/3/4 offline; steps 2/5 online), student-facing “Proof Feedback Studio”.
- Assumptions/dependencies: Availability of high-quality reference solutions aligned to the syllabus; teacher approval of induced rubrics; guardrails for edge cases; privacy/compliance for student data; model access/cost; acceptance of ordinal agreement metrics (AC2/QWK) for audits.
- Competition platforms (Math Olympiads, MathArena, AoPS): Pre-scoring and triage of submissions
- What: Automatically cluster reference solutions, match submissions to the closest approach family, and assign partial credit before human review; flag likely zero-progress vs near-complete solutions.
- Tools/products/workflows: Contest “triage queue,” rubric cache per problem; auditor panel with AC2/QWK reports; integration with submission portals for structured error feedback.
- Assumptions/dependencies: Curated reference sets per problem; adjudicator oversight; caching to keep costs low; clear contest policies on AI-assist and appeals.
- Tutoring and self-paper (Daily life): Personalized proof feedback and progress tracking
- What: Students upload solutions and receive milestone-based feedback and partial-credit estimation; guidance on missing steps and common fallacies (e.g., Proof by Example, Circular Reasoning).
- Tools/products/workflows: Mobile/web “Proof Coach” integrating RefGrader; personal milestone tracker; hint generation tied to aha-moment decomposition.
- Assumptions/dependencies: Reference solutions or high-quality worked examples; model reliability on targeted topics; disclaimers that grades are advisory.
- Academic benchmarking (AI research): Consistent, rubric-aware LLM-as-a-judge for math evaluations
- What: Replace binary validity checks with calibrated partial-credit judgments across model outputs; report a suite of metrics (Pearson/Spearman, MAE/RMSE, Off-by-1/2, QWK, AC2).
- Tools/products/workflows: Evaluation harness using 3-step Ref-Grader (No Rubrics) for quick screens and 5-step variants for high-fidelity grading; benchmark leaderboards with agreement audits.
- Assumptions/dependencies: Comparable reference coverage across tasks; documented grader prompts; published agreement analyses; human spot checks to prevent drift.
- Model training (AI/software): Reward modeling for proof-generating LLMs
- What: Use rubric-informed, reference-grounded scores as dense rewards for reinforcement learning to steer model trajectories toward complete, correct proofs.
- Tools/products/workflows: Training pipelines integrating RefGrader outputs; curriculum that aligns rewards to milestones; logging of per-step rationales for interpretability.
- Assumptions/dependencies: Stable grader behavior across updates; careful reward shaping to avoid gaming; compute budgets; licensing of references.
- Technical screening (Industry: finance/tech): Automated grading of candidate proof tasks in interviews
- What: Evaluate candidate reasoning in quantitative roles (e.g., proving algorithm properties, optimization arguments) with partial credit and error localization.
- Tools/products/workflows: Recruiting portal integration; rubric induction per question; consistency audits via AC2; human-in-the-loop adjudication.
- Assumptions/dependencies: Domain-appropriate references; fairness/bias monitoring; candidate consent and data handling.
- Educational policy and grader calibration: Cross-grader consistency tools
- What: Use induced rubrics and pooled agreement metrics (AC2) to calibrate human graders, reduce variance in partial-credit assignment, and standardize grading guides.
- Tools/products/workflows: “Rubric Studio” to generate, compare, and lock rubrics; grader calibration sessions with confusion matrix visualizations; appeal workflows.
- Assumptions/dependencies: Institutional acceptance; policies for tie-breaking and rubric updates; documented error taxonomies.
Long-Term Applications
These applications are plausible extensions but require further research, validation, scaling, or policy changes.
- Standardized testing (Policy/education): AI-assisted partial-credit scoring at scale
- What: Deploy RefGrader-like workflows for math-intensive standardized exams (e.g., AP, IB, national olympiads), with audited fairness, reliability, and appeals.
- Tools/products/workflows: Secure rubric repositories per item; metadata-driven reference retrieval; formal bias audits with AC2/QWK and subgroup analyses.
- Assumptions/dependencies: Regulatory approval; robust generalization across demographics, languages, and curricula; transparency requirements; secure infrastructure.
- Cross-domain reasoning graders (Education, law, healthcare, engineering): From math proofs to structured arguments
- What: Adapt reference-aided rubric induction to physics derivations, algorithmic proofs, clinical guideline alignment, and legal argumentation (evidence/milestones).
- Tools/products/workflows: Domain-specific reference libraries; milestone taxonomies beyond math; mixed expert-AI rubric authoring; risk-tiered deployment.
- Assumptions/dependencies: High-quality domain references; reliability on non-mathematical logic; new error taxonomies; strong governance for safety-critical fields (healthcare, law).
- Formal methods integration (Software/robotics): Hybrid rubric + formal verification
- What: Combine RefGrader’s milestone rubrics with Lean/Isabelle/Coq backends to check correctness while scoring human-readable progress.
- Tools/products/workflows: “Proof Bridge” that maps natural-language steps to formal lemmas; dual pipelines for readability (rubric) and soundness (formal proof); developer IDE extensions.
- Assumptions/dependencies: Mappings from informal to formal corpora; coverage of formal libraries; usability for non-experts; substantial engineering.
- Autonomous contest/journal pre-screening (Academia): End-to-end triage of submissions
- What: Automatically cluster approaches, estimate completeness and novelty, and flag likely errors before human review in contests or math journals.
- Tools/products/workflows: Approach-family indexing; novelty detection; “Fallacy Radar” for subtle logical mistakes; editor dashboards.
- Assumptions/dependencies: IP permissions on references; tolerance for false positives/negatives; community acceptance; audit trails.
- Knowledge-base mining and pedagogy (Education): Aha-moment analytics for curriculum design
- What: Mine reference solution clusters to identify teaching bottlenecks; reorder content based on approachability scores and typical error patterns.
- Tools/products/workflows: Milestone analytics; syllabus recommender; automated hint banks aligned to steps; formative assessment generators.
- Assumptions/dependencies: Representative datasets; stable step-level scoring; instructor buy-in; iterative validation studies.
- Multilingual and accessibility expansion (Global education): Inclusive grading and feedback
- What: Extend workflows to diverse languages and notation conventions; accessible feedback for students with different learning needs.
- Tools/products/workflows: Localized references; multilingual prompting; notation normalization; accessibility features (e.g., structured step summaries).
- Assumptions/dependencies: Language coverage in frontier models; localized corpora; cultural alignment of rubrics; accessibility compliance.
- AI governance and evaluation policy (Policy/AI): Standardized judge protocols with agreement guarantees
- What: Use pooled-marginal metrics (AC2) and ordinal kappa as audit standards for LLM-as-a-judge systems; publish calibration reports as part of model evaluations.
- Tools/products/workflows: “Judge Card” templates with metric dashboards; rater drift monitoring; sampling/ensembling protocols; transparency commitments.
- Assumptions/dependencies: Community convergence on audit metrics; reproducible evaluator prompts; third-party oversight.
- Personalized learning analytics (Daily life/EdTech): Longitudinal milestone tracking
- What: Monitor learners’ progress at the step/milestone level across courses; recommend targeted practice and measure mastery beyond final answers.
- Tools/products/workflows: Milestone heatmaps; micro-rubrics per skill; adaptive homework generation; parental/teacher reports.
- Assumptions/dependencies: Data integration across platforms; privacy-preserving analytics; robust per-learner calibration; equitable recommendations.
Glossary
- AC2 (Gwet’s AC2): An agreement coefficient for ordinal ratings that uses pooled marginals, making it more robust to skewed category frequencies than QWK. "To assess agreement between {hat{g}_i} and {g_i}, we report Pearson and Spearman correlations, mean absolute error (MAE), root mean squared error (RMSE), off-by-one and off-by-two tolerance rates, quadratic weighted kappa (QWK), and Gwet's AC2."
- Agentic workflows: Multi-step, goal-directed LLM procedures that coordinate subtasks (e.g., reference extraction, rubric creation) to improve grading quality. "we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem‑specific rubrics for a multi‑step grading process."
- Approachability (scores): A step-level difficulty measure (1–5) indicating how likely a solver is to choose or discover a solution step; used to weight rubric points. "the 5-step Ref-Grader (Approachability), which in step 3 computes step-level approachability scores (1-5, measuring how hard a main step is to be chosen)"
- Automated theorem proving (ATP): The use of formal systems and algorithms to prove mathematical theorems automatically. "automated theorem proving (ATP) datasets target formal theorem proving"
- Binarized evaluation: Reducing proof assessment to a binary valid/invalid judgment rather than assigning partial credit. "An alternative is to binarize proofs and measure agreement with expert judges"
- Calibration gaps: Systematic discrepancies between predicted and true scores, reflecting miscalibrated partial credit or severity judgments. "models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned."
- Chain-of-Thought: A prompting strategy that elicits step-by-step reasoning traces from LLMs to improve problem solving. "including Chain-of-Thought and self-consistency"
- Confusion matrix (normalized): A table of prediction versus ground-truth frequencies scaled to probabilities to visualize grading errors across score categories. "Normalized confusion matrices for single-turn grading on MathArena and IMO Shortlist."
- Formal verification: Checking mathematical proofs within a formal system to guarantee correctness by machine-verifiable standards. "Formal verification offers a principled solution to validation"
- Milestone-based rubric: A grading scheme that awards points for reaching specific intermediate statements or “milestones” in a reference solution. "the 5-step Ref-Grader (Milestones), which in step 4 designs the rubric by milestones reached"
- Off-by-one tolerance rate: The fraction of predictions within one point of the true grade, used to summarize near-miss accuracy. "we report Pearson and Spearman correlations, mean absolute error (MAE), root mean squared error (RMSE), off-by-one and off-by-two tolerance rates, quadratic weighted kappa (QWK), and Gwet's AC2."
- Off-by-two tolerance rate: The fraction of predictions within two points of the true grade, used to measure broader near-miss accuracy. "we report Pearson and Spearman correlations, mean absolute error (MAE), root mean squared error (RMSE), off-by-one and off-by-two tolerance rates, quadratic weighted kappa (QWK), and Gwet's AC2."
- Ordinal labels: Ordered categorical grades where the label distances carry rank information but not interval magnitude. "QWK ... measures agreement on ordinal labels while accounting for chance."
- Pooled marginal distribution: An average of raters’ category frequencies used by AC2 to compute chance disagreement more robustly. "It replaces the independence baseline p_i q_j with a pooled marginal distribution π_i computed across raters"
- Quadratic weighted kappa (QWK): A chance-corrected agreement measure for ordinal ratings that penalizes larger disagreements more heavily. "Quadratic weighted kappa (QWK)."
- Reference solution clustering: Grouping multiple reference solutions by similarity to organize grading around the most relevant approach. "Reference Solution Clustering: The model clusters the reference solutions into groups based on their similarity."
- Rubric induction: Automatically deriving problem-specific grading criteria (point allocations and rules) from reference solutions. "a 3-step reference variant without rubric induction."
- Self-consistency: An inference-time strategy that samples multiple reasoning paths and aggregates them to improve reliability. "including Chain-of-Thought and self-consistency"
- Solution matching: Selecting the reference solution group most similar to the submitted solution to guide analysis and scoring. "Solution Matching: The model finds the most similar group of reference solutions to the given solution and uses it as a reference to grade the given solution."
- Zero-inflated distribution: A distribution with an excess of zeros relative to standard models, common when many solutions earn no credit. "The MathArena grade distribution is zero-inflated because many model-generated solutions receive a zero"
Collections
Sign up for free to add this paper to one or more collections.





