Papers
Topics
Authors
Recent
Search
2000 character limit reached

FineMath: Dual Modality Math Dataset

Updated 8 February 2026
  • FineMath is a dual-modality dataset comprising an English corpus of symbolic, step-by-step solutions and a Chinese benchmark for elementary math word problems.
  • It enhances LLM training by integrating high-quality deductive examples and calibrated difficulty levels to improve performance on benchmarks like GSM8K, MATH, and MMLU-STEM.
  • Innovative annotation, rigorous data filtering, and targeted pretraining integration enable measurable gains in mathematical reasoning for small and medium LMs.

FineMath refers to two distinct open datasets targeting mathematical reasoning by LLMs: (a) FineMath, a large-scale English-language training corpus for symbolic and step-by-step solutions in small LMs (Allal et al., 4 Feb 2025); and (b) FineMath, a fine-grained Chinese mathematical evaluation benchmark for elementary-level math word problems (Liu et al., 2024). Both introduce substantial innovations in their respective modalities—pretraining corpus construction and evaluation methodology.

1. Objectives and Scope

English FineMath was motivated by recognized deficits in existing web math corpora, particularly the scarcity of high-quality step-by-step solution data and the over-concentration on advanced, formulaic mathematical texts with low pedagogical impact. The dataset aims to supply a significantly larger volume of explicit, deductive, early undergraduate- and below-level mathematical reasoning, with strong representation of symbolic solutions and intermediate calculation steps. It is designed for integration into the late-stage pretraining of small to medium LMs, specifically for quantitative performance improvements on downstream benchmarks such as GSM8K, MATH, and MMLU-STEM (Allal et al., 4 Feb 2025).

Chinese FineMath addresses the need for a systematic and difficulty-calibrated evaluation benchmark for Chinese-language LLMs, focusing on elementary mathematics word problems (MWPs). Its primary objective is to map model performance in a granular manner across diverse problem categories and explicit reasoning depths, as measured by the number of atomic inference steps required per solution (Liu et al., 2024).

2. Data Acquisition, Annotation, and Preprocessing

English FineMath

Data is sourced from Common Crawl WARC files targeting the FineWeb-Edu crawl set (comprising 5.8 billion URLs), supplemented by domain expansion through frequency analysis of URLs in OpenWebMath (OWM) and InfiMM-WebMath (Allal et al., 4 Feb 2025). Annotation utilizes a two-stage classifier-based pipeline using Llama-3.1-70B-Instruct:

  • Silver Labeling: 3-point prompt, domains with ≥10 pages scoring ≥2 are considered.
  • Gold Labeling: 5-point prompt, emphasizing step-by-step middle/high-school solutions; only pages scoring ≥4 included in FineMath4+, ≥3 in FineMath3+.

Preprocessing is performed by re-extracting candidate URLs with the OWM pipeline (Resiliparse for boilerplate removal and LaTeX preservation), deduplication via single-band MinHash LSH, fastText language classification to retain only English text, and contamination filtering against GSM8K, MATH, MMLU using 13-gram longest common subsequence overlap (LCS ratio ≥ 0.6).

Chinese FineMath

Manually curated from Chinese textbooks, workbooks, and online repositories, the dataset only retains MWPs with unambiguous text-based queries and closed-form answers, discarding image-dependent or externally referenced content (Liu et al., 2024). The 1,584 problems are annotated in four stages: categorization into 17 types, standardization of prompt format, decomposition of solutions into atomic reasoning steps, and construction of four-choice MCQ variants with distractors modeled after AQUA.

Difficulty is assigned by explicit counting of atomic solution steps:

  • Level-1: 1 step
  • Level-2: 2 steps
  • Level-3: 3 or more steps

Each category contains at least 60 problems, with a minimum of 20 per difficulty level.

3. Dataset Composition and Structure

English FineMath

Released variants include FineMath3+ (34B tokens, 21.4M documents, average 1,590 tokens/doc) and FineMath4+ (10B tokens, 6.7M documents, average 1,490 tokens/doc), reflecting the gold-label score thresholds. Document lengths span from ~100 to ~5,000 tokens, corresponding to short proofs and extended worked examples, respectively. Empirical topic distribution (in ~1K samples) is: Algebra/equations 40%, Calculus 25%, Combinatorics/probability 15%, Geometry/trigonometry 10%, Number theory/miscellaneous 10% (Allal et al., 4 Feb 2025).

Average formula complexity in FineMath4+ documents, measured as mean GPT-2 token count of LaTeX math segments, is approximately 120.

Chinese FineMath

FineMath consists of 1,584 MWPs, evenly distributed across 17 categories, which fall under five curriculum-aligned domains plus "Others": Number & Operations, Measurement, Data Analysis & Probability, Algebra, Geometry, and two classic forms (Optimization, Tree Planting). Each question is annotated for number of reasoning steps and includes both generative and MCQ evaluation forms, with four distractors per MCQ modeled to AQUA standards. There is no train/dev/test split: the full set is used as an evaluation benchmark (Liu et al., 2024).

4. Integration into LLM Training and Evaluation

Pretraining Application (English FineMath)

FineMath was injected at the concluding (stage 4) decay phase of SmolLM2's 11T-token pretraining, targeting the final 10–11T tokens with a learning rate decay of 10%. The composition at this phase:

  • FineMath4+: ~10% of data mixture (r_{FineMath} ≈ 0.10)
  • InfiWebMath3+: ~3.9%
  • OWM: 0.08%
  • AugGSM8K: 0.02% Total math share: 14%

FineMath subsetting enables precise mixture tuning; researchers are advised to upsample FineMath4+ during the last 5–10% of pretraining, with a recommended data mixture share of 8–12%. For evaluation-robustness, re-decontamination against test sets is critical (Allal et al., 4 Feb 2025).

Evaluation Protocol (Chinese FineMath)

Zero-shot evaluation is performed under two task paradigms:

  • Generation: Free-form answer generation to various prompt types.
  • Option Prediction: Four-choice MCQ selection.

Prompt template and task structure significantly affect measured performance, with prompt wording alone yielding ±15 percentage point shifts in accuracy, and MCQ versus generative formats showing bidirectional effects depending on model baseline. Contamination analysis with external corpora (e.g., Ape210K) reveals artificial inflation for models trained on overlapping data; reliable benchmarking requires overlap filtering (Liu et al., 2024).

5. Empirical Impact and Results

English FineMath

Ablation studies show superior performance for FineMath-derived corpora relative to existing baselines. Key results (mid-training, 3 epoch equivalent) (Allal et al., 4 Feb 2025):

Dataset GSM8K MATH MMLU-STEM
OWM 10% 6% 5.5%
InfiMM-WebMath 14% 4% 5.8%
Infi-WebMath4+ 22% 9% 7.0%
FineMath3+ 28% 12% 8.4%
FineMath4+ 31% 16% 9.2%

Late-stage introduction of FineMath4+ drove large end-stage gains in SmolLM2 on GSM8K (from 10% to 32.6%) and MATH (from 4.5% to 11.5%).

Chinese FineMath

Zero-shot accuracy (Prompt 0, generation-only) across representative LLMs (Liu et al., 2024):

Model Accuracy
GPT-4 73%
GPT-3.5-Turbo 62%
ChatGLM2-6B 51%
Baichuan2-7B-Chat 43%
Qwen-7B-Chat 42%
InternLM-Chat-7B 38%
MathGLM-10B 37%
MathGLM-335M 31%

Accuracy deteriorates with reasoning depth: for GPT-4, 82% (1-step), 76% (2-step), 61% (≥3-step). Error analysis shows persistent difficulty in areas such as counting problems and optimization, and strong sensitivity to prompt formulation and answer format.

6. Representative Examples

English FineMath4+ includes stepwise solutions at varying complexity: quadratic equation factorization, definite integration via power rule, and combinatoric binomial coefficient expansion, all with LaTeX preserved (Allal et al., 4 Feb 2025).

Chinese FineMath’s problems include (Level-3 example): partitioning of objects given ratios and sequential operations, with explicit annotation of each atomic calculation and answer (Liu et al., 2024).

7. Recommendations and Prospective Developments

English FineMath should be leveraged as an upsampled component for late-stage pretraining or fine-tuning, particularly when targeting symbolic, multi-step, and algebraic solution quality in small LMs. Stringent decontamination against evaluation sets is advised (Allal et al., 4 Feb 2025).

Chinese FineMath enables rigorously stratified benchmarking, illuminating gaps in multi-step reasoning and prompt sensitivity. Future directions identified include curriculum expansion (to middle and high school), enriched chain-of-thought annotations for evaluating work transparency, protocol standardization across prompt types, and contamination-resilient dataset construction (Liu et al., 2024).

FineMath thus provides critical infrastructure for both training and evaluation of mathematical reasoning in LLMs, supporting rapid gains in quantitative benchmarks and deeper analysis of LLM mathematical competence across linguistic and curricular domains.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FineMath Dataset.