LiveAoPSBench: Dynamic LLM Math Benchmark
- LiveAoPSBench is a dynamically evolving benchmark designed to assess large language models on Olympiad-level mathematical problems.
- It employs an automated pipeline—including forum scraping, problem detection, solution rewriting, and timestamping—to generate high-fidelity QA pairs while preventing contamination.
- The benchmark reveals temporal performance decay in LLMs and sets a new standard for rigorous, scalable evaluation of mathematical reasoning.
LiveAoPSBench is a dynamically evolving, timestamped benchmark for evaluating LLMs on Olympiad-level mathematical reasoning tasks. It is derived from continuous, automated mining of the Art of Problem Solving (AoPS) forum, which hosts community-generated problems and solutions spanning pre-college to Olympiad mathematics. LiveAoPSBench was specifically designed to address the limitations of static, contamination-prone benchmarks in accurately assessing the true mathematical reasoning capacities of LLMs (Mahdavi et al., 24 Jan 2025).
1. Automated Pipeline for Construction
The creation of LiveAoPSBench leverages an extensive automated pipeline to enable scalable, high-fidelity benchmark generation with minimal human intervention:
- Raw Forum Scraping
- The system crawls the entire AoPS community forum, scraping 1,076,712 “topics” (threads containing posts and replies).
- Each topic is stored along with metadata (topic_id, author, timestamp, raw HTML content, and problem-specific tags such as “Algebra” or “High School Olympiad”).
- Math-Question Detection (Step 1)
- A Qwen 2.5 14B LLM classifier uses few-shot prompting to identify whether the first post in each topic poses a genuine Olympiad-level problem (\boxed{1}), pruning to 478,337 qualifying “math-question” topics.
- Question-Answer Parsing (Step 2)
- For all selected topics, Llama 3.1 70B extracts:
- 1. The precise competition-style problem statement (in LaTeX),
- 2. All valid solution posts,
- 3. JSON output with problem and annotated answers.
- Solution Rewriting (Step 3)
- To enforce detailed, chain-of-thought reasoning, each solution is rewritten using Qwen 2.5 72B. This converts terse user responses into explicit stepwise solutions, for example, breaking multiplicative order deductions into annotated logical steps.
- Timestamp Assignment & Versioning
- Every QA pair is timestamped according to its original forum post. Calendar-based splits (e.g., 2023-split, 2024-split) provide rolling, contamination-resistant benchmarks. As new forum problems are posted, monthly LiveAoPSBench updates are automatically generated.
- Contamination-Resistance (Step 4)
- Decontamination: Stringent n-gram substring filters remove problems overlapping with public training data (10-gram for training set, stricter 8-gram for evaluation set).
- Cross-LLM Validation: Solutions are independently rewritten by two different LLMs (Llama 3.1 70B-Ins and Qwen 2.5 72B-Ins); final answers must agree via string/numeric/symbolic (SymPy-equivalence) match.
- Deduplication: Hash-based and fuzzy string matching prunes duplicates.
- Human Verification: Random 10% sample spot-checked by two graduate annotators; 88% unanimous correctness, inter-annotator agreement ≈ 0.91.
2. Dataset Composition and Statistics
LiveAoPSBench’s evolving splits exhibit detailed breakdowns by answer quantity, problem difficulty, and subject area:
| Statistic | 2024 Split Value | Notes |
|---|---|---|
| Total problems (M) | 3,863 | January–August 2024 |
| Single-answer problems | 60.4% | |
| Two-answer problems | 24.1% | |
| ≥3-answer problems | 15.5% | |
| Difficulty: Middle School | 7.4% | By AoPS tag |
| Difficulty: High School | 34.9% | |
| Difficulty: College | 8.1% | |
| High-School Olympiads | 25.2% | |
| Other categories | 24.4% | |
| Algebra | 28% | |
| Combinatorics | 21% | |
| Geometry | 19% | |
| Number Theory | 18% | |
| Inequalities & Analysis | 14% |
The benchmark expands at a nearly constant rate of new qualified QA pairs per month:
where at (January 2023 start).
3. Evaluation Protocol
LiveAoPSBench utilizes a zero-shot, chain-of-thought prompting paradigm, with the following workflow:
- Prompt Format
- Each model receives:
- 5
- Models may generate unrestricted step-by-step reasoning, but only the boxed final answer is parsed and scored.
- Automated Scoring
- Numeric: direct value match.
- Symbolic: equivalence via automated SymPy checks.
- 3. Full or zero credit; partial credit (for multipart answers) is supported in proposal but not enabled by default.
- Performance Metrics
- Primary:
- Optional (for multipart answers):
4. Empirical Patterns and Contamination Effects
Comparative evaluation across annual splits (2023 vs. 2024) with 18 diverse LLMs yields several findings:
Performance Decay Over Time
- All models show a consistent drop in accuracy—a model’s performance on 2023-split substantially exceeds its score on the newer 2024-split.
- For instance, Qwen 2.5 72B-Ins declines from (drop ).
- The observed accuracy drop spans 0 across models.
- Fitting a simple regression:
1
where typical 2month.
Contamination Linkage
- Higher overlap between evaluation and pre-training datasets leads to greater inflation in accuracy on older splits.
- Pearson correlation 3 observed between pairwise-overlap rate and 4 (accuracy drop) across model families.
A plausible implication is that static benchmarks quickly become compromised as soon as their items propagate into LLM training data, thus inflating the perceived progress of mathematical reasoning abilities.
5. Contamination-Resistance and Verification Strategies
LiveAoPSBench’s methodology systematically targets the major contamination pathways inherent in open-model evaluation:
- N-gram Decontamination:
- Evaluation set employs 8-gram exact substring filtering against all public math corpora to minimize accidental pre-training overlap.
- Training data uses a more permissive 10-gram filter.
- Cross-LLM Agreement:
- An example from the dataset is retained only if both LLMs produce the exact same boxed answer, verified via string match, numerical match, or symbolic equivalence.
- Deduplication and Human Verification:
- Near-duplicate detection (hash-based, fuzzy string) and periodic spot-checking—random 10% review with ≥88% correctness and inter-rater reliability ≈ 0.91—ensure fidelity and non-trivial error rates.
6. Significance, Limitations, and Prospects
LiveAoPSBench sets a precedent for evolving, timestamped benchmarks in evaluating LLM mathematical reasoning:
- Dynamic Updating: Continuous monthly expansion prevents obsolescence and circumvents silent contamination.
- Automated Scalability: Leveraging crowd-sourced discussions and solution dynamics, scalable extraction of tens of thousands of Olympiad-level QA pairs is feasible with minimal supervision.
- Strict Decontamination: Eight-gram and cross-model-answer protocols, plus human audit, result in ∼88% verified benchmark correctness.
- Revealing Temporal Artifacts: The persistent accuracy decline on newer splits indicates most prior benchmarking (on static test sets) likely overstates LLM mathematical generalization.
Key future directions include:
- Extending evaluation beyond answer-scoring to full proof-writing assessment,
- Incorporating visual geometry by parsing forum-supplied diagrams,
- Adapting the pipeline and contamination-resistant principles to other STEM domains (e.g., physics, computer science), broadening reliable LLM evaluation (Mahdavi et al., 24 Jan 2025).