DeepScaleR: Scalable Math LLMs

Updated 19 July 2025

DeepScaleR is a suite of small-scale LLM reasoning systems that uses reinforcement learning and reward shaping for efficient, multi-step mathematical reasoning.
It integrates GRPO, adaptive length penalties, and a curated math curriculum to boost benchmark performance while reducing computational costs.
DeepScaleR drives advances in chain-of-thought generation and adaptive inference, setting the stage for scalable, cost-effective LLM research.

DeepScaleR is a suite of small-scale LLM reasoning systems and benchmarks developed to advance cost-effective, scalable, and efficient multi-step reasoning in mathematical domains. Centered on the use of reinforcement learning (RL) and specialized reward shaping, DeepScaleR investigates techniques to rapidly boost the reasoning capabilities of 1.5B-parameter models (notably DeepSeek-R1-Distill-Qwen-1.5B) on competition-level math tasks, with an emphasis on efficiency, modest training budgets, and practical performance trade-offs. The DeepScaleR corpus and models have become a cornerstone for subsequent work on RL-guided reasoning, efficient chain-of-thought (CoT) generation, and explorations in sample efficiency, curriculum methods, adaptive compute, and dynamic alignment for LLMs.

1. Foundations of the DeepScaleR Approach

DeepScaleR was developed to address the challenge of enabling small LLMs to perform high-quality mathematical reasoning under severe compute and dataset constraints (Dang et al., 20 Mar 2025). Its core design builds on the insight that RL with carefully chosen reward functions and datasets can quickly induce strong reasoning behavior in models with only 1–2 billion parameters, bypassing the need for expensive pretraining or large-scale supervised finetuning.

The DeepScaleR curriculum blends competition-level math problems (e.g., AIME, AMC, MATH500) and experiment-driven RL strategies to incentivize stepwise, accurate, and, in later iterations, concise reasoning. Empirical results highlight rapid increases in benchmark accuracy (e.g., AMC23: 63%→80%, AIME24: 46.7%), often with total GPU costs orders of magnitude below mainstream LLM training. The DeepScaleR dataset and model weights are open-sourced and have subsequently served as both a benchmark suite and an LLM finetuning recipe for downstream research.

2. Reinforcement Learning Techniques and Reward Engineering

The principal RL method underlying DeepScaleR is Group Relative Policy Optimization (GRPO), which performs RL fine-tuning without the need for an explicit value function estimator or large critic model (Dang et al., 20 Mar 2025). For each prompt, a group of candidate completions is generated; groupwise rewards are normalized, and the policy is updated to maximize improvement relative to a reference distribution while regularizing via KL divergence:

$J_\text{GRPO}(\theta) = \mathbb{E}_{q, \{o_i\} \sim \pi_{\theta_\text{old}}} \left[\frac{1}{G} \sum_i \min\left\{ \frac{\pi_\theta(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)} A_i, \mathrm{clip}\left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)}, 1-\epsilon, 1+\epsilon \right)A_i \right\} - \beta D_{KL}(\pi_\theta \| \pi_\text{ref}) \right]$

where $A_i$ is a standardized group advantage, and $\beta$ controls the KL penalty. This approach achieves stability and rapid learning in small-parameter models and supports regularization to prevent overconfident, low-entropy completions.

Subsequent research further optimized reward design:

HAPO (History-Aware Policy Optimization) (Huang et al., 16 May 2025): Combines accuracy and conciseness rewards, referencing the shortest correct solution so far, and applies a cosine-shaped penalty to extra tokens:

$r^{(j)}_i = \mathbf{1}(a_i^{(j)} = a_i^*) + w \cdot rl_i^{(j)}$

with $rl_i^{(j)}$ based on the deviation from historical minima, guiding the model to "beat its own record" in answer length.

Adaptive Length Penalty (ALP) (Xiang et al., 5 Jun 2025): Penalizes generation length inversely with observed per-prompt solve rate, ensuring easy problems get shorter answers and harder instances are unaffected:

$r(y, q) = r_\text{accuracy} - \beta N \cdot \max(p_\text{solved}(q), 1/K)$

where $p_\text{solved}$ is the empirical solve rate and $N$ normalizes length.

Additional experiments explored alternative RL algorithms (e.g., PPO) and demonstrated that single or dual training examples provided via RL with verifiable reward (RLVR) could nearly match the gains from full-dataset DeepScaleR subset fine-tuning (Wang et al., 29 Apr 2025).

3. Curriculum Design, Data Efficiency, and Question Augmentation

DeepScaleR emphasizes data efficiency by carefully filtering and curating math datasets to maximize signal per sample. Filtering strategies included selecting problems with unambiguous boxed answers, removing trivial or multipart questions, and balancing easy/hard splits for optimal learning curves (Dang et al., 20 Mar 2025). The main training mix typically included 7,000–21,000 high-quality math problems, enabling strong performance improvements with minimal computational expenditure (total training costs under $50 for baseline models).

To further address the sparse reward problem in RL on difficult multi-step reasoning, QuestA (Li et al., 17 Jul 2025) introduced question augmentation (Editor’s term: "augmented RL curriculum"), wherein difficult questions are combined with partial solutions (hints) during training. This intervention dramatically increases the likelihood of correct trajectory sampling, improves gradient flow, and boosts both pass@1 and pass@k metrics. Theoretically, QuestA reduces the RL sample complexity from $\Theta(1/\delta_p) $to$ \Theta(1/\delta_p') $where$ \delta_p' $is the probability of completing a solution when conditioned on partial hints, a substantial improvement in sample efficiency:</p> <p>$ \mathcal{S}(q) = \{\tau \mid R(q, \tau) = 1\}, \quad C(q, \delta_p) = \arg\min_{S: \sum_{\tau \in S} P_\mu(q, \tau) \ge 1-\delta_p} $</p> <p>Table: Representative DeepScaleR Developments</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Component</th> <th>Key Innovation</th> <th>Reference</th> </tr> </thead><tbody><tr> <td>RL fine-tuning</td> <td>GRPO for groupwise advantage</td> <td>(<a href="/papers/2503.16219" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Dang et al., 20 Mar 2025</a>)</td> </tr> <tr> <td>Conciseness</td> <td>HAPO, ALP rewards</td> <td>(<a href="/papers/2505.11225" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Huang et al., 16 May 2025</a>, <a href="/papers/2506.05256" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Xiang et al., 5 Jun 2025</a>)</td> </tr> <tr> <td>Curriculum</td> <td>Partial solution augmentation</td> <td>(<a href="/papers/2507.13266" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Li et al., 17 Jul 2025</a>)</td> </tr> <tr> <td>Data</td> <td>Curation for math reasoning</td> <td>(<a href="/papers/2503.16219" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Dang et al., 20 Mar 2025</a>)</td> </tr> </tbody></table></div><h2 class='paper-heading' id='comparative-training-efficiency-and-scaling'>4. Comparative Training Efficiency and Scaling</h2> <p>A principal design goal of DeepScaleR was to democratize the training of reasoning-capable LLMs. Experiments established that DeepScaleR training pipelines could be run on 4 NVIDIA A40 GPUs in 24 hours, using about 7,000 samples for a total cost of approximately$ 42 (Dang et al., 20 Mar 2025). The FastCuRL framework (Song et al., 21 Mar 2025), a successor curriculum RL approach, demonstrated even higher efficiency: achieving superior performance to DeepScaleR while halving the number of training steps and reducing maximum hardware requirements from 32 GPUs in three stages to a single 8-GPU node.

This progression highlights a shift toward staged, input length-aware curricula, context scaling, and prompt segmentation, all designed to optimize GPU use and training time while preserving or improving accuracy and reasoning quality.

5. Adaptive Computation, Alignment, and Efficiency

DeepScaleR also contributed to the development of efficient inference strategies:

Token and Compute Efficiency: Post-training with ALP enables DeepScaleR-1.5B to approximately halve token usage, primarily by adjusting generation length according to the online solve rate per query (Xiang et al., 5 Jun 2025). The model automatically "thinks" less on easy problems and reallocates resources to more difficult ones, yielding a fivefold difference in token budget between the hardest and easiest instances at fixed accuracy.
Flexible Realignment: Training-time and inference-time realignment frameworks (TrRa, InRa) further compress sequence length—TrRa-iter achieves 54.63% token usage reduction compared to DeepScaleR-1.5B-Preview’s 33.86%—via controllable fusion of model logits and a lightweight layer adapter that allows on-demand adjustment between fast and slow thinking modes (Zhu et al., 15 Jun 2025).
Test-Time Scaling and Extrapolation: DeepScaleR’s methodologies are tightly coupled with advances in "test-time scaling" (Setlur et al., 10 Jun 2025): models are trained to continue improving at extended inference horizons by learning to chain operations (generation, verification, refinement) and leveraging negative gradient updates to enhance exploration at longer token budgets. This enables the models to continually improve on hard problems when granted additional inference compute.

6. Empirical Performance and Benchmarking

DeepScaleR models (and their enhanced variants) consistently set or match state-of-the-art performance for open-source 1.5B-parameter models on major math reasoning benchmarks. Notable results include:

AIME24: improvements from ~40.4% pass@1 to 67.1% with QuestA-DeepScaleR (Li et al., 17 Jul 2025).
AIME25: gains from 31.35% (baseline) to 59.5% (QuestA) pass@1.
HMMT25: advances from 31.5% to 35.5% pass@1.
Average length reductions of 33-59% with negligible accuracy drops under HAPO (Huang et al., 16 May 2025), and up to 50% with ALP (Xiang et al., 5 Jun 2025).
Flexible realignment frameworks further reduce length without sacrificing accuracy (Zhu et al., 15 Jun 2025).
Training with only a single high-value example using RLVR can nearly match full dataset improvements (MATH500: 36%→73.6%) (Wang et al., 29 Apr 2025).

Models are routinely benchmarked on GSM8K, AIME, AMC, MATH500, OlympiadBench, and related datasets, with evaluation metrics including pass@1, pass@k (sample-based scoring), and average tokens per output.

7. Implications, Impact, and Future Directions

DeepScaleR has catalyzed a shift toward efficient, data- and compute-conscious approaches for teaching LLMs to reason. It established that with appropriate reward engineering, curation, and training protocols, small models can close ground on much larger counterparts for reasoning tasks. The curriculum and adaptation methods have informed advances in curriculum design (FastCuRL), data augmentation via partial solutions (QuestA), alignment and realignment strategies, and adaptive inference mechanisms.

Open questions remain regarding scaling across domains, the precise theoretical mechanisms underpinning data-efficient RL (such as post-saturation generalization), and the optimal synthesis of exploration and curriculum for multi-step, multi-domain reasoning. Active areas include refinement of reward normalization, extension of adaptive penalties and length control to new settings, and investigation into robust out-of-distribution extrapolation at test time.

DeepScaleR models, benchmarks, and training pipelines are widely available in the open-source community, enabling broad adoption and further experimentation in complex reasoning scenarios. The platform continues to influence work on scalable, practical, and customizable LLM reasoning systems.