Papers
Topics
Authors
Recent
2000 character limit reached

MATH-Beyond: Enhancing RL for Math Reasoning

Updated 15 October 2025
  • MATH-Beyond is a selective zero-baseline benchmark designed to push RL techniques beyond extensive sampling in solving challenging math problems.
  • The dataset is meticulously filtered from DAPO-Math-17K and DeepScaleR, ensuring problems remain unsolvable by base models even at high sampling budgets.
  • Empirical results reveal that exploration-centric RL methods significantly boost expansion rates, underscoring the need for novel reasoning strategies.

MATH-Beyond (MATH-B) is a selective benchmark designed to catalyze reinforcement learning (RL) research in mathematical reasoning by requiring methods to expand a LLM’s capabilities beyond what is accessible to the base model, even with extensive sampling. Unlike widely used math benchmarks where base LLMs of moderate scale (up to 8B parameters) solve most problems at large draw counts (e.g., pass@1024), MATH-B includes only problems for which the base models have virtually no success under these high sampling budgets. By construction, any improvement on MATH-B through RL or alternative post-training must reflect a true expansion in the model’s reasoning repertoire. The dataset, derived from the DAPO-Math-17K and DeepScaleR corpora, is rigorously filtered, post-processed, and partitioned to facilitate exploration-centric research in mathematical skill acquisition.

1. Motivation and Benchmark Philosophy

MATH-B is motivated by the observation that standard open-source benchmarks such as MATH-500 and AIME 2024 have been saturated: base models can solve nearly all problems when sampled sufficiently (e.g., pass@1024). This exposes a critical limitation—current RL fine-tuning approaches primarily sharpen existing solution strategies rather than elicit novel reasoning behaviors. The intention with MATH-B is to provide a “zero-baseline” benchmark: base models fail (pass@1024 ≈ 0) by design. Thus, any post-training method must move beyond mere exploitation of base model modes and instead foster new, exploration-driven reasoning capabilities.

This benchmark is positioned to address the gap between the theoretical aspirations of RL (as seen in domains like Atari or AlphaZero, which discover novel strategies absent from initial policies) and its typical application in LLM mathematical reasoning where improvement is commonly achieved solely via enhanced sampling or solution selection.

2. Dataset Construction and Filtering Pipeline

The construction of MATH-B follows a stringent pipeline to guarantee that included problems genuinely challenge the reasoning limits of base models:

  • Source Pool: Problems are drawn from DAPO-Math-17K and DeepScaleR datasets, covering competition-style high-school mathematics.
  • Deterministic Quality Filters: Problems with ambiguous answer types (multiple answer tuples, unordered lists) and those of multiple-choice format are systematically removed to ensure clarity and evaluability. Only problems for which the ground-truth answer can be programmatically verified are retained.
  • Base Model Screening: An initial screening eliminates problems readily solvable by base models under a small sampling budget (e.g., pass@16 with DeepSeek-R1-Distill-Qwen2.5-7B), further filtering out tractable or less challenging questions.
  • Frontier Model Verification: Remaining problems are checked by high-performing models (e.g., GPT-5-Mini, o4-mini-high) for the validity of the answer annotations.
  • Benchmark Deduplication: Problems are deduplicated against leading evaluation sets (e.g., MATH-500, AIME, AMC) via exact string match for novelty.
  • Zero-Baseline Construction: The final step involves running a large suite of current base models (Qwen2.5, OLMo, Llama-3.1) with pass@1024 sampling. Only problems that are not solved in any draw by these models are selected for MATH-B. This process ensures that the set of problems is outside the base model’s reachable solution set, denoted ℛₖ(q, D) = ∅, for the draw size k=1024.

The result is a dataset with several partitions for fine-grained evaluation:

Subset Description Typical Use
Union (MATH-B-U) Problems that at least one base model fails Main evaluation pool
Intersection (MATH-B-I) Problems that all base models fail Core challenge set
Model-specific Tailored for each base model, problems unsolved at baseline Targeted RL analysis

3. Evaluation Metrics and Reporting

MATH-B is evaluated primarily using the empirical pass@k metric, with k=1024 as standard:

  • For a given problem xx, kk completions y1,...,yky_1, ..., y_k are sampled from model pp.
  • The pass@k indicator is defined as:

pass@k(p;x)=I(i:yiC(x))\text{pass@k}(p; x) = \mathbb{I}(\exists i: y_i \in \mathcal{C}(x))

where C(x)\mathcal{C}(x) is the set of accepted answers.

  • The overall score is averaged over the dataset:

pass@k(p)=1DxDpass@k(p;x)\text{pass@k}(p) = \frac{1}{|D|} \sum_{x \in D} \text{pass@k}(p; x)

  • Since base model pass@1024 scores are ≈0 by design, any solve by a post-trained policy reflects true expansion. The key metric is Expansion Rate:

Expansion Rate=EkD=pass@1024(π)\text{Expansion Rate} = \frac{|\mathcal{E}_k|}{|D|} = \text{pass@1024}(\pi)

where Ek\mathcal{E}_k is the set of newly solved problems by the post-trained policy π\pi.

Alternative conceptual measures (Shrinkage, Preservation, Consolidation) collapse on MATH-B due to the zero baseline.

4. Empirical Findings and Methodological Implications

Empirical evaluation of state-of-the-art RL models reveals significant limitations:

  • RL-finetuned models, such as Nemotron-Research-Reasoning-Qwen-1.5B (v1,v2) and DeepScaleR-1.5B-Preview, achieve only modest Expansion Rates on MATH-B (e.g., 5.22% to 9.57%).
  • Conversely, models fine-tuned via long chain-of-thought distillation on more powerful teacher policies (Qwen3-4B, Qwen3-8B) attain much higher Expansion Rates (59% and 66%), suggesting the critical role of exposing models to new reasoning strategies during training rather than relying solely on RL-generated exploration.
  • This pattern supports the claim that existing RL approaches mainly consolidate or sharpen pre-existing skills rather than expand the set of solution modes in the policy’s support.

A plausible implication is that RL methods on LLMs for math reasoning must be designed explicitly for exploration, possibly by integrating external teacher policies, leveraging curriculum design, or using unsolved instance discovery, to achieve substantial Expansion Rates on zero-baseline benchmarks like MATH-B.

5. Benchmark Usage and Access

MATH-B is published for public use at https://huggingface.co/datasets/brendel-group/MATH-Beyond, and documentation is available at https://brendel-group.github.io/math-beyond/.

The recommended protocol for research is as follows:

  1. Evaluate the base model on MATH-B with pass@1024. The expectation is a success rate near zero.
  2. Apply a post-training method (RL or alternative) to the same base model.
  3. Re-evaluate on MATH-B with the same protocol; every new successful problem constitutes an authentic Expansion Rate.
  4. Use the provided Union, Intersection, and model-specific splits for holistic and targeted assessment.
  5. Leverage topic and difficulty annotations for further ablation or domain-specific analyses.

6. Technical Significance and Future Research Directions

MATH-B establishes a new rigorous standard for assessing genuine reasoning expansion in mathematical LLMs. Its zero-baseline design blocks shortcut progress via repeated sampling, enforcing that post-training methods lead to qualitative shifts in problem-solving ability. The benchmark thus precisely operationalizes what it means for an RL approach to “go beyond the base model.”

A plausible future direction is the development of RL techniques centered on exploration—such as novelty search, counterexample-guided RL, or explicit objective diversification—to significantly increase expansion rates on MATH-B. Additionally, analysis of the relationship between problem topic, human-labeled difficulty, and model expansion could help isolate where current LLMs fail and guide curricular improvements or targeted augmentation.

7. Conclusions

MATH-Beyond (MATH-B) is a rigorously constructed, zero-baseline benchmark that forces RL-based post-training on LLMs to move from mere solution sharpening to real reasoning expansion. It is designed to catalyze the development of exploration-centric RL methodologies and to provide a realistic, high-resolution testbed for advanced mathematical skill acquisition. All resources, including topic-difficulty annotations, are available for the research community, and future benchmarks can build on this paradigm to probe and extend the fundamental reasoning boundaries of LLMs.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MATH-Beyond (MATH-B).