Papers
Topics
Authors
Recent
Search
2000 character limit reached

LiveAoPSBench: Dynamic LLM Math Benchmark

Updated 13 March 2026
  • LiveAoPSBench is a dynamically evolving benchmark designed to assess large language models on Olympiad-level mathematical problems.
  • It employs an automated pipeline—including forum scraping, problem detection, solution rewriting, and timestamping—to generate high-fidelity QA pairs while preventing contamination.
  • The benchmark reveals temporal performance decay in LLMs and sets a new standard for rigorous, scalable evaluation of mathematical reasoning.

LiveAoPSBench is a dynamically evolving, timestamped benchmark for evaluating LLMs on Olympiad-level mathematical reasoning tasks. It is derived from continuous, automated mining of the Art of Problem Solving (AoPS) forum, which hosts community-generated problems and solutions spanning pre-college to Olympiad mathematics. LiveAoPSBench was specifically designed to address the limitations of static, contamination-prone benchmarks in accurately assessing the true mathematical reasoning capacities of LLMs (Mahdavi et al., 24 Jan 2025).

1. Automated Pipeline for Construction

The creation of LiveAoPSBench leverages an extensive automated pipeline to enable scalable, high-fidelity benchmark generation with minimal human intervention:

  • Raw Forum Scraping
    • The system crawls the entire AoPS community forum, scraping 1,076,712 “topics” (threads containing posts and replies).
    • Each topic is stored along with metadata (topic_id, author, timestamp, raw HTML content, and problem-specific tags such as “Algebra” or “High School Olympiad”).
  • Math-Question Detection (Step 1)
    • A Qwen 2.5 14B LLM classifier uses few-shot prompting to identify whether the first post in each topic poses a genuine Olympiad-level problem (\boxed{1}), pruning to 478,337 qualifying “math-question” topics.
  • Question-Answer Parsing (Step 2)
    • For all selected topics, Llama 3.1 70B extracts:
    • 1. The precise competition-style problem statement (in LaTeX),
    • 2. All valid solution posts,
    • 3. JSON output with problem and annotated answers.
  • Solution Rewriting (Step 3)
    • To enforce detailed, chain-of-thought reasoning, each solution is rewritten using Qwen 2.5 72B. This converts terse user responses into explicit stepwise solutions, for example, breaking multiplicative order deductions into annotated logical steps.
  • Timestamp Assignment & Versioning
    • Every QA pair is timestamped according to its original forum post. Calendar-based splits (e.g., 2023-split, 2024-split) provide rolling, contamination-resistant benchmarks. As new forum problems are posted, monthly LiveAoPSBench updates are automatically generated.
  • Contamination-Resistance (Step 4)
    • Decontamination: Stringent n-gram substring filters remove problems overlapping with public training data (10-gram for training set, stricter 8-gram for evaluation set).
    • Cross-LLM Validation: Solutions are independently rewritten by two different LLMs (Llama 3.1 70B-Ins and Qwen 2.5 72B-Ins); final answers must agree via string/numeric/symbolic (SymPy-equivalence) match.
    • Deduplication: Hash-based and fuzzy string matching prunes duplicates.
    • Human Verification: Random 10% sample spot-checked by two graduate annotators; 88% unanimous correctness, inter-annotator agreement ≈ 0.91.

2. Dataset Composition and Statistics

LiveAoPSBench’s evolving splits exhibit detailed breakdowns by answer quantity, problem difficulty, and subject area:

Statistic 2024 Split Value Notes
Total problems (M) 3,863 January–August 2024
Single-answer problems 60.4%
Two-answer problems 24.1%
≥3-answer problems 15.5%
Difficulty: Middle School 7.4% By AoPS tag
Difficulty: High School 34.9%
Difficulty: College 8.1%
High-School Olympiads 25.2%
Other categories 24.4%
Algebra 28%
Combinatorics 21%
Geometry 19%
Number Theory 18%
Inequalities & Analysis 14%

The benchmark expands at a nearly constant rate of λ1,000\lambda \approx 1,000 new qualified QA pairs per month:

N(t)=N0+λtN(t) = N_0 + \lambda t

where N00N_0 \approx 0 at t=0t = 0 (January 2023 start).

3. Evaluation Protocol

LiveAoPSBench utilizes a zero-shot, chain-of-thought prompting paradigm, with the following workflow:

  • Prompt Format
    • Each model receives:
    • N(t)=N0+λtN(t) = N_0 + \lambda t5
    • Models may generate unrestricted step-by-step reasoning, but only the boxed final answer is parsed and scored.
  • Automated Scoring
    • Numeric: direct value match.
    • Symbolic: equivalence via automated SymPy checks.
    • 3. Full or zero credit; partial credit (for multipart answers) is supported in proposal but not enabled by default.
  • Performance Metrics
    • Primary:

    Accuracy=1Mi=1M1[y^i=yi]\mathrm{Accuracy} = \frac{1}{M} \sum_{i=1}^{M} \mathbf{1} [\hat y_i = y_i] - Optional (for multipart answers):

    Score=1Mi#correct sub-answers#total sub-answers\mathrm{Score} = \frac{1}{M} \sum_i \frac{\#\text{correct sub-answers}}{\#\text{total sub-answers}}

4. Empirical Patterns and Contamination Effects

Comparative evaluation across annual splits (2023 vs. 2024) with 18 diverse LLMs yields several findings:

  • Performance Decay Over Time

    • All models show a consistent drop in accuracy—a model’s performance on 2023-split substantially exceeds its score on the newer 2024-split.
    • For instance, Qwen 2.5 72B-Ins declines from 42.36%40.45%42.36\% \rightarrow 40.45\% (drop 4.51%4.51\%).
    • The observed accuracy drop spans N(t)=N0+λtN(t) = N_0 + \lambda t0 across models.
    • Fitting a simple regression:

    N(t)=N0+λtN(t) = N_0 + \lambda t1

    where typical N(t)=N0+λtN(t) = N_0 + \lambda t2month.

  • Contamination Linkage

    • Higher overlap between evaluation and pre-training datasets leads to greater inflation in accuracy on older splits.
    • Pearson correlation N(t)=N0+λtN(t) = N_0 + \lambda t3 observed between pairwise-overlap rate and N(t)=N0+λtN(t) = N_0 + \lambda t4 (accuracy drop) across model families.

A plausible implication is that static benchmarks quickly become compromised as soon as their items propagate into LLM training data, thus inflating the perceived progress of mathematical reasoning abilities.

5. Contamination-Resistance and Verification Strategies

LiveAoPSBench’s methodology systematically targets the major contamination pathways inherent in open-model evaluation:

  • N-gram Decontamination:
    • Evaluation set employs 8-gram exact substring filtering against all public math corpora to minimize accidental pre-training overlap.
    • Training data uses a more permissive 10-gram filter.
  • Cross-LLM Agreement:
    • An example from the dataset is retained only if both LLMs produce the exact same boxed answer, verified via string match, numerical match, or symbolic equivalence.
  • Deduplication and Human Verification:
    • Near-duplicate detection (hash-based, fuzzy string) and periodic spot-checking—random 10% review with ≥88% correctness and inter-rater reliability ≈ 0.91—ensure fidelity and non-trivial error rates.

6. Significance, Limitations, and Prospects

LiveAoPSBench sets a precedent for evolving, timestamped benchmarks in evaluating LLM mathematical reasoning:

  1. Dynamic Updating: Continuous monthly expansion prevents obsolescence and circumvents silent contamination.
  2. Automated Scalability: Leveraging crowd-sourced discussions and solution dynamics, scalable extraction of tens of thousands of Olympiad-level QA pairs is feasible with minimal supervision.
  3. Strict Decontamination: Eight-gram and cross-model-answer protocols, plus human audit, result in ∼88% verified benchmark correctness.
  4. Revealing Temporal Artifacts: The persistent accuracy decline on newer splits indicates most prior benchmarking (on static test sets) likely overstates LLM mathematical generalization.

Key future directions include:

  • Extending evaluation beyond answer-scoring to full proof-writing assessment,
  • Incorporating visual geometry by parsing forum-supplied diagrams,
  • Adapting the pipeline and contamination-resistant principles to other STEM domains (e.g., physics, computer science), broadening reliable LLM evaluation (Mahdavi et al., 24 Jan 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiveAoPSBench.