Art of Problem Solving: Data & LLM Benchmarks
- Art of Problem Solving is an online community specializing in Olympiad-level mathematics with a vast repository of challenges and detailed solutions.
- An automated pipeline curates over 650k QA pairs using classification, answer extraction, and solution rewriting to support LLM training.
- LiveAoPSBench offers a timestamped, contamination-resistant benchmark, ensuring robust evaluation of advanced mathematical reasoning in LLMs.
The Art of Problem Solving (AoPS) is a prominent online community and forum specializing in Olympiad-level mathematics problems and solutions. In recent research, the AoPS platform has become a foundational resource for constructing large-scale, high-quality datasets and benchmarks for training and evaluating LLMs in advanced mathematical reasoning. The forum's vast repository of user-generated problems, solutions, and discussions enables the systematic extraction of question-answer (QA) pairs at a scale and level of difficulty previously unattainable in automated fashion (Mahdavi et al., 24 Jan 2025).
1. AoPS as a Data Source for LLMs
AOoPS hosts over one million "topics," with a sustained annual influx of at least 15,000 new math questions from 2020–2024. The forum's focus is on Olympiad and advanced problem-solving, with a significant proportion (approximately 75%) tagged at the high school Olympiad or collegiate level. The content covers proof-style problems (32%), numerical-answer problems (28%), and a broad spectrum of combinatorics, geometry, number theory, and algebra.
Mahdavi et al. introduced an automated pipeline to leverage AoPS, constructing two key resources: AoPS-Instruct, a large Olympiad-level training dataset with detailed solutions, and LiveAoPSBench, a contamination-resistant, timestamped evaluation benchmark (Mahdavi et al., 24 Jan 2025).
2. Automated Data Extraction Pipeline
The pipeline for extracting and refining AoPS content operates as follows:
- Step 0 – Raw Forum Collection: All 1,076,712 AoPS topics are scraped and split by timestamp, with topics created by December 2023 allocated to training and those from January to August 2024 held out for evaluation.
- Step 1 – Math-Question Detection: Qwen 2.5 14B is utilized as a binary classifier to detect if the initial post in each topic contains a concrete math problem, yielding 478,337 retained topics.
- Step 2 – Q/A Extraction: The first post serves as the question, and subsequent posts are candidate answers. Llama 3.1 70B – INS parses threads and exports structured JSON objects enumerating all user-provided solutions.
- Step 3 – Solution Rewriting: Given the brevity and implicit reasoning of many community solutions, Qwen 2.5 72B rewrites each answer into fully detailed, step-by-step solutions. Explicit reasoning (e.g., AM–GM applications) is made manifest for improved chain-of-thought fine-tuning.
- Step 4 – Decontamination: QA pairs are discarded if any 10-gram substring appears in test splits of existing math benchmarks such as MATH, GSM8K, or OlympiadBench.
The extraction workflow is formalized in the following pseudocode (Algorithm 1):
3. AoPS-Instruct Dataset
After filtering and decontamination, AoPS-Instruct provides 647,255 QA pairs for training. The distribution of answers per question is as follows: 60% possess a single answer, 24% have two, and 8% have three. Exact substring overlap with other math datasets (Numina, GSM8K, MATH) is below 14.1%, confirming the novelty of the resource.
AoPS-Instruct's core innovation is in its large-scale, fully automated curation of Olympiad-level QA data, combined with LLM-generated, explicit, and detailed solutions tailored for LLM instruction fine-tuning.
| Attribute | Value / Description | Notes |
|---|---|---|
| Total QA pairs | 647,255 | After decontamination |
| Answers per question | 1: 60%, 2: 24%, 3: 8% | |
| Difficulty tags | ~75% high school/olympiad | |
| Content categories | Proof (32%), Numerical (28%), Other | Combinatorics, geometry, algebra |
| Overlap with benchmarks | <14.1% (10-gram substrings) | Against GSM8K, MATH, etc |
4. LiveAoPSBench: Contamination-Resistant Evaluation
LiveAoPSBench is constructed using data from January 2023–September 2024, with a stricter 8-gram decontamination step and further filters:
- Only questions with explicit boxed answers () are included.
- Each solution is independently rewritten by both Llama 3.1 70B and Qwen 2.5 72B.
- A QA item is retained only if (a) both rewritten solutions and (b) the original answer all match under string equality, numerical value, and SymPy-based symbolic equivalence.
The final LiveAoPSBench-0824 set contains 3,863 items, all from 2024. Quality verification involved dual manual annotation of a 10% subsample (386 items), revealing 88% correctness, 8% incorrectness, and 4% "no-answer," with inter-annotator agreement at 91%.
Timestamped evolution enables construction of future evaluation splits by selecting only items posted after any chosen cutoff, ensuring minimal risk of pretraining contamination.
| Step | 2024 Raw Items | After Decontam | Boxed Only | After LLM Cross-check |
|---|---|---|---|---|
| QA pairs | 14,158 | 13,494 | 7,173 | 3,863 |
5. Evaluation Protocols and Empirical Observations
Evaluation on LiveAoPSBench is performed using accuracy: the percentage of problems where the model's final answer exactly matches the ground truth boxed answer. SymPy symbolic equivalence is applied for symbolic expressions.
A significant empirical finding is a consistent drop in LLM performance on newer (post-cutoff) test items, indicating prior benchmark results were affected by pretraining contamination. For instance, Qwen 2.5 72B-INS achieves on 2023 test problems but only on 2024, a $4.51$ percentage point drop. Smaller or general-purpose LLMs exhibit even larger reductions (up to 23.6pp).
Fine-tuning on AoPS-Instruct consistently yields greater gains in accuracy compared to fine-tuning on existing datasets such as Numina. Combining AoPS-Instruct with Numina in instruction fine-tuning leads to the best results for most architectures.
| SFT Recipe | Accuracy on LiveAoPSBench-2024 (DeepSeek-Math 7B) |
|---|---|
| No SFT | 11.7% |
| SFT on Numina | 16.3% |
| SFT on AoPS-Instruct | 20.1% |
| SFT on Numina + AoPS | 19.7% |
Ablation studies show that omitting solution rewriting (using raw user solutions) degrades accuracy by 10–15pp, and using a stronger LLM (Qwen 2.5 72B vs. Llama 70B) for rewriting contributes an additional 4–6pp to the most challenging benchmarks.
6. Contributions and Generalizability
AoPS-Instruct is the first fully automated, large-scale (650k) Olympiad-level QA dataset featuring LLM-rewritten, step-by-step solutions. LiveAoPSBench is likewise the first timestamped, contamination-resistant math benchmark supporting robust measurement of true mathematical generalization in LLMs. The entire pipeline, being fully automated and extensible, provides a template for constructing high-quality, domain-specific datasets and evolving benchmarks from other topical online communities (Mahdavi et al., 24 Jan 2025).
Empirical assessments affirm both the importance of dataset novelty and the critical role of detailed, explicit solution presentation for downstream LLM math reasoning performance. These resources are directly available via https://github.com/DSL-Lab/aops.