JustRL: Minimal RL for LLM Math Reasoning
- JustRL is a minimalistic reinforcement learning framework for large language models that enhances mathematical reasoning with a single-stage pipeline.
- It employs fixed hyperparameters and a streamlined PPO approach to achieve SOTA performance while reducing compute usage by nearly 50%.
- Empirical results on DeepSeek and Nemotron backbones demonstrate higher Pass@1 accuracy on rigorous math benchmarks compared to complex, multi-stage methods.
JustRL is a minimalistic reinforcement learning (RL) framework for LLMs, designed to maximize mathematical reasoning capability with a single-stage RL pipeline and fixed hyperparameters. Contrasting with prevailing research trends that favor increasing complexity—multi-stage curricula, dynamic scheduling, explicit stabilization tricks—JustRL demonstrates that a streamlined protocol can match or surpass state-of-the-art (SOTA) performance on 1.5B-parameter Transformer LLMs while cutting compute requirements by half (He et al., 18 Dec 2025).
1. Foundational Motivation and Scope
JustRL addresses a central question in LLM RL research: Is the typical complexity of RL training pipelines necessary for stability and SOTA results, or does simplicity suffice at scale? Commonly deployed practices—length penalties, curriculum learning, KL resets, adaptive schedules—are often seen as essential for small or unstable models. JustRL hypothesizes that, given sufficiently scaled baselines and robust minimally tuned recipes, much of this complexity is superfluous. The focus is strictly on mathematical reasoning with 1.5B parameter decoder-only Transformers, pre-trained and distilled before applying RL fine-tuning.
2. Model Architectures and RL Objective
2.1 Backbone Models
JustRL evaluates two mainstream 1.5B backbone architectures:
- DeepSeek-R1-Distill-Qwen-1.5B
- OpenMath-Nemotron-1.5B
Both employ standard decoder-only Transformer designs: 24 layers, hidden size ≈2048, 32 attention heads, rotary position embeddings, and a 16k token context window. Initial weights are established via conventional pre-training and subsequent distillation on larger LLM outputs.
2.2 RL Formulation
The LLM acts as a stochastic policy mapping prompt to sequence . Rewards are delivered by a binary rule-based verifier . Optimization utilizes the fixed-clipping variant of Proximal Policy Optimization (PPO)—GRPO in veRL nomenclature:
For each token, the probability ratio is
The clipped loss is
where is the advantage estimate and is the clip range. No additional KL penalty or entropy bonus is employed.
3. Hyperparameter Configuration
A single, constant configuration governs both backbone models throughout RL fine-tuning. No dynamic schedules or curriculum are introduced.
| Hyperparameter | Value/Setting | Rationale |
|---|---|---|
| Batch size | 256 (across 32 A800-80GB GPUs) | Sufficient throughput for large models |
| Learning rate | (constant) | No decay or schedule required for stable training |
| Prompt length | tokens | Fits within context window limits |
| Response length | tokens, max context $16k$ | Empirically stable without penalty |
| Rollouts/sample | 8 per example | Standard RLHF/PPO practice |
| PPO mini-batch | 64 examples; micro-batch per GPU=1 | Large batch, stable gradient estimates |
| Clip ratio | Avoids instability, matches PPO best practice | |
| Sampling temperature | 1.0 (training); 0.7 (eval) | Basic stochasticity for exploration/evaluation |
| Reward verifier | Rule-based DAPO (binary); CompassVerifier-3B (eval) | Easily deployed and interpretable |
This configuration enables smooth training without the need for dynamic adjustment. It reflects field-wide consensus on PPO for LLM RLHF while showing generality across backbone architectures.
4. Training and Evaluation Protocols
4.1 Compute Regimen
JustRL-DeepSeek was trained for 4,380 gradient steps (≈15 days on 32 A800 GPUs, ≈1.4×10⁸k tokens). JustRL-Nemotron required 3,440 steps on the same hardware, processing ≈1.1×10⁸k tokens. Training is continuous—no multi-stage curricula, length annealing, or switching between phases.
4.2 Benchmarks and Sampling
Evaluation employs Pass@1 accuracy averaged over nine mathematically rigorous benchmarks:
- AIME 2024/25
- AMC 2023
- MATH-500
- Minerva Math
- OlympiadBench
- HMMT Feb 2025
- CMIMC 2025
- BRUMO 2025
For most benchmarks, samples are generated per prompt; for MATH-500, Minerva, and OlympiadBench, is used due to scale. Generation settings for evaluation are temperature=$0.7$, top-, and maximum $32k$ tokens.
5. Empirical Results and Comparative Analysis
5.1 Performance
| Model | Initial Avg. | SOTA Baseline | JustRL Score | Compute Used | Result Synopsis |
|---|---|---|---|---|---|
| JustRL-DeepSeek | 37.65% | ProRL-V2: 53.08% | 54.87% | 1.4×10⁸k (<½ ProRL-V2) | Wins 6/9, single stage |
| JustRL-Nemotron | 56.74% | QuestA: 63.81% | 64.32% | 1.1×10⁸k (<½ QuestA) | Wins 5/9, efficient SOTA |
JustRL outperforms dynamic multi-stage competitors (ProRL-V2, QuestA) in both accuracy and compute efficiency, particularly on mathematical reasoning tasks. Despite leveraging weaker or strong initial backbones, single-stage RL augments performance substantially without additional complex interventions.
5.2 Training Dynamics
- Policy entropy remains stable (oscillates around 1.2–1.4).
- Mean reward increases monotonically (–0.6 → +0.4).
- Response length decays naturally (from ∼8,000 to ∼4,000-5,000 tokens) without explicit penalty mechanisms.
Absence of plateaus, collapses, or unexpected training instabilities signals that curriculum schedules and penalties may be compensating for effects not inherent to large, stable baselines.
6. Ablation and Intervention Analysis
Ablations beginning with JustRL-DeepSeek uncovered the deleterious effects of commonly deployed stabilization tricks:
- Adding “overlong penalty” drives performance to a plateau at ∼50% (from ∼55%).
- Introducing DeepScaleR robust verifier further degrades outcomes (plateau at ∼45%).
Both interventions collapse policy entropy (∼0.5–0.6) and degrade performance, demonstrating that they suppress exploration under minimal, stable RL conditions. This cautions against routine use of length penalties and aggressive verifiers without first evaluating effects on simple baselines.
7. Field Implications, Open Questions, and Recommendations
JustRL’s findings advocate for a “less is more” approach where single-stage PPO, constant hyperparameters, and basic prompting suffice for SOTA RLHF at scale, with compute savings of 2× compared to elaborate alternatives. It implies that stabilization and complexity, often added incrementally, may inadvertently address instabilities of unnecessary interventions rather than inherent difficulties in the RL paradigm.
A plausible implication is that future work should anchor evaluations on robust, minimal protocols prior to introducing additional mechanisms, reserving complexity for clear empirical failures rather than prophylactic application. Key open questions remain regarding generalizability to coding, scaling to larger models, and robustness under noisy or less precise reward definitions.
JustRL establishes a simple and validated baseline for 1.5B-scale LLM mathematical reasoning, urging the research community to rigorously benchmark complex techniques against pared-down, stable alternatives before widespread adoption (He et al., 18 Dec 2025).