JustRL: Minimal RL for LLM Math Reasoning

Updated 19 December 2025

JustRL is a minimalistic reinforcement learning framework for large language models that enhances mathematical reasoning with a single-stage pipeline.
It employs fixed hyperparameters and a streamlined PPO approach to achieve SOTA performance while reducing compute usage by nearly 50%.
Empirical results on DeepSeek and Nemotron backbones demonstrate higher Pass@1 accuracy on rigorous math benchmarks compared to complex, multi-stage methods.

JustRL is a minimalistic reinforcement learning (RL) framework for LLMs, designed to maximize mathematical reasoning capability with a single-stage RL pipeline and fixed hyperparameters. Contrasting with prevailing research trends that favor increasing complexity—multi-stage curricula, dynamic scheduling, explicit stabilization tricks—JustRL demonstrates that a streamlined protocol can match or surpass state-of-the-art (SOTA) performance on 1.5B-parameter Transformer LLMs while cutting compute requirements by half (He et al., 18 Dec 2025).

1. Foundational Motivation and Scope

JustRL addresses a central question in LLM RL research: Is the typical complexity of RL training pipelines necessary for stability and SOTA results, or does simplicity suffice at scale? Commonly deployed practices—length penalties, curriculum learning, KL resets, adaptive schedules—are often seen as essential for small or unstable models. JustRL hypothesizes that, given sufficiently scaled baselines and robust minimally tuned recipes, much of this complexity is superfluous. The focus is strictly on mathematical reasoning with 1.5B parameter decoder-only Transformers, pre-trained and distilled before applying RL fine-tuning.

2. Model Architectures and RL Objective

2.1 Backbone Models

JustRL evaluates two mainstream 1.5B backbone architectures:

DeepSeek-R1-Distill-Qwen-1.5B
OpenMath-Nemotron-1.5B

Both employ standard decoder-only Transformer designs: 24 layers, hidden size ≈2048, 32 attention heads, rotary position embeddings, and a 16k token context window. Initial weights are established via conventional pre-training and subsequent distillation on larger LLM outputs.

2.2 RL Formulation

The LLM acts as a stochastic policy $\pi_\theta$ mapping prompt $x$ to sequence $y$ . Rewards are delivered by a binary rule-based verifier $r(y, x) \in \{0, 1\}$ . Optimization utilizes the fixed-clipping variant of Proximal Policy Optimization (PPO)—GRPO in veRL nomenclature:

$J(\theta) = \mathbb{E}_{x \sim D, y \sim \pi_\theta} [r(y, x)]$

For each token, the probability ratio is

$r_t(\theta) = \frac{\pi_\theta(y_t | y_{<t}, x)}{\pi_\theta^{\text{ref}} (y_t | y_{<t}, x)}$

The clipped loss is

$L^{\text{CLIP}}(\theta) = \mathbb{E}_t \Big[ \min \big( r_t(\theta) \, \hat{A}_t, \; \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \, \hat{A}_t \big) \Big]$

where $\hat{A}_t$ is the advantage estimate and $\epsilon \in [0.8,1.28]$ is the clip range. No additional KL penalty or entropy bonus is employed.

3. Hyperparameter Configuration

A single, constant configuration governs both backbone models throughout RL fine-tuning. No dynamic schedules or curriculum are introduced.

Hyperparameter	Value/Setting	Rationale
Batch size	256 (across 32 A800-80GB GPUs)	Sufficient throughput for large models
Learning rate	$1 \times 10^{-6}$ (constant)	No decay or schedule required for stable training
Prompt length	$\leq1k$ tokens	Fits within context window limits
Response length	$\leq15k$ tokens, max context $16k$	Empirically stable without penalty
Rollouts/sample	8 per example	Standard RLHF/PPO practice
PPO mini-batch	64 examples; micro-batch per GPU=1	Large batch, stable gradient estimates
Clip ratio	$[0.8,1.28]$	Avoids instability, matches PPO best practice
Sampling temperature	1.0 (training); 0.7 (eval)	Basic stochasticity for exploration/evaluation
Reward verifier	Rule-based DAPO (binary); CompassVerifier-3B (eval)	Easily deployed and interpretable

This configuration enables smooth training without the need for dynamic adjustment. It reflects field-wide consensus on PPO for LLM RLHF while showing generality across backbone architectures.

4. Training and Evaluation Protocols

4.1 Compute Regimen

JustRL-DeepSeek was trained for 4,380 gradient steps (≈15 days on 32 A800 GPUs, ≈1.4×10⁸k tokens). JustRL-Nemotron required 3,440 steps on the same hardware, processing ≈1.1×10⁸k tokens. Training is continuous—no multi-stage curricula, length annealing, or switching between phases.

4.2 Benchmarks and Sampling

Evaluation employs Pass@1 accuracy averaged over nine mathematically rigorous benchmarks:

AIME 2024/25
AMC 2023
MATH-500
Minerva Math
OlympiadBench
HMMT Feb 2025
CMIMC 2025
BRUMO 2025

For most benchmarks, $N=32$ samples are generated per prompt; for MATH-500, Minerva, and OlympiadBench, $N=4$ is used due to scale. Generation settings for evaluation are temperature=$0.7$, top- $p=0.9$ , and maximum $32k$ tokens.

5. Empirical Results and Comparative Analysis

5.1 Performance

Model	Initial Avg.	SOTA Baseline	JustRL Score	Compute Used	Result Synopsis
JustRL-DeepSeek	37.65%	ProRL-V2: 53.08%	54.87%	1.4×10⁸k (<½ ProRL-V2)	Wins 6/9, single stage
JustRL-Nemotron	56.74%	QuestA: 63.81%	64.32%	1.1×10⁸k (<½ QuestA)	Wins 5/9, efficient SOTA

JustRL outperforms dynamic multi-stage competitors (ProRL-V2, QuestA) in both accuracy and compute efficiency, particularly on mathematical reasoning tasks. Despite leveraging weaker or strong initial backbones, single-stage RL augments performance substantially without additional complex interventions.

5.2 Training Dynamics

Policy entropy remains stable (oscillates around 1.2–1.4).
Mean reward increases monotonically (–0.6 → +0.4).
Response length decays naturally (from ∼8,000 to ∼4,000-5,000 tokens) without explicit penalty mechanisms.

Absence of plateaus, collapses, or unexpected training instabilities signals that curriculum schedules and penalties may be compensating for effects not inherent to large, stable baselines.

6. Ablation and Intervention Analysis

Ablations beginning with JustRL-DeepSeek uncovered the deleterious effects of commonly deployed stabilization tricks:

Adding “overlong penalty” drives performance to a plateau at ∼50% (from ∼55%).
Introducing DeepScaleR robust verifier further degrades outcomes (plateau at ∼45%).

Both interventions collapse policy entropy (∼0.5–0.6) and degrade performance, demonstrating that they suppress exploration under minimal, stable RL conditions. This cautions against routine use of length penalties and aggressive verifiers without first evaluating effects on simple baselines.

7. Field Implications, Open Questions, and Recommendations

JustRL’s findings advocate for a “less is more” approach where single-stage PPO, constant hyperparameters, and basic prompting suffice for SOTA RLHF at scale, with compute savings of 2× compared to elaborate alternatives. It implies that stabilization and complexity, often added incrementally, may inadvertently address instabilities of unnecessary interventions rather than inherent difficulties in the RL paradigm.

A plausible implication is that future work should anchor evaluations on robust, minimal protocols prior to introducing additional mechanisms, reserving complexity for clear empirical failures rather than prophylactic application. Key open questions remain regarding generalizability to coding, scaling to larger models, and robustness under noisy or less precise reward definitions.

JustRL establishes a simple and validated baseline for 1.5B-scale LLM mathematical reasoning, urging the research community to rigorously benchmark complex techniques against pared-down, stable alternatives before widespread adoption (He et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to JustRL.