Papers
Topics
Authors
Recent
2000 character limit reached

JustRL: Minimal RL for LLM Math Reasoning

Updated 19 December 2025
  • JustRL is a minimalistic reinforcement learning framework for large language models that enhances mathematical reasoning with a single-stage pipeline.
  • It employs fixed hyperparameters and a streamlined PPO approach to achieve SOTA performance while reducing compute usage by nearly 50%.
  • Empirical results on DeepSeek and Nemotron backbones demonstrate higher Pass@1 accuracy on rigorous math benchmarks compared to complex, multi-stage methods.

JustRL is a minimalistic reinforcement learning (RL) framework for LLMs, designed to maximize mathematical reasoning capability with a single-stage RL pipeline and fixed hyperparameters. Contrasting with prevailing research trends that favor increasing complexity—multi-stage curricula, dynamic scheduling, explicit stabilization tricks—JustRL demonstrates that a streamlined protocol can match or surpass state-of-the-art (SOTA) performance on 1.5B-parameter Transformer LLMs while cutting compute requirements by half (He et al., 18 Dec 2025).

1. Foundational Motivation and Scope

JustRL addresses a central question in LLM RL research: Is the typical complexity of RL training pipelines necessary for stability and SOTA results, or does simplicity suffice at scale? Commonly deployed practices—length penalties, curriculum learning, KL resets, adaptive schedules—are often seen as essential for small or unstable models. JustRL hypothesizes that, given sufficiently scaled baselines and robust minimally tuned recipes, much of this complexity is superfluous. The focus is strictly on mathematical reasoning with 1.5B parameter decoder-only Transformers, pre-trained and distilled before applying RL fine-tuning.

2. Model Architectures and RL Objective

2.1 Backbone Models

JustRL evaluates two mainstream 1.5B backbone architectures:

Both employ standard decoder-only Transformer designs: 24 layers, hidden size ≈2048, 32 attention heads, rotary position embeddings, and a 16k token context window. Initial weights are established via conventional pre-training and subsequent distillation on larger LLM outputs.

2.2 RL Formulation

The LLM acts as a stochastic policy πθ\pi_\theta mapping prompt xx to sequence yy. Rewards are delivered by a binary rule-based verifier r(y,x){0,1}r(y, x) \in \{0, 1\}. Optimization utilizes the fixed-clipping variant of Proximal Policy Optimization (PPO)—GRPO in veRL nomenclature:

J(θ)=ExD,yπθ[r(y,x)]J(\theta) = \mathbb{E}_{x \sim D, y \sim \pi_\theta} [r(y, x)]

For each token, the probability ratio is

rt(θ)=πθ(yty<t,x)πθref(yty<t,x)r_t(\theta) = \frac{\pi_\theta(y_t | y_{<t}, x)}{\pi_\theta^{\text{ref}} (y_t | y_{<t}, x)}

The clipped loss is

LCLIP(θ)=Et[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t \Big[ \min \big( r_t(\theta) \, \hat{A}_t, \; \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \, \hat{A}_t \big) \Big]

where A^t\hat{A}_t is the advantage estimate and ϵ[0.8,1.28]\epsilon \in [0.8,1.28] is the clip range. No additional KL penalty or entropy bonus is employed.

3. Hyperparameter Configuration

A single, constant configuration governs both backbone models throughout RL fine-tuning. No dynamic schedules or curriculum are introduced.

Hyperparameter Value/Setting Rationale
Batch size 256 (across 32 A800-80GB GPUs) Sufficient throughput for large models
Learning rate 1×1061 \times 10^{-6} (constant) No decay or schedule required for stable training
Prompt length 1k\leq1k tokens Fits within context window limits
Response length 15k\leq15k tokens, max context $16k$ Empirically stable without penalty
Rollouts/sample 8 per example Standard RLHF/PPO practice
PPO mini-batch 64 examples; micro-batch per GPU=1 Large batch, stable gradient estimates
Clip ratio [0.8,1.28][0.8,1.28] Avoids instability, matches PPO best practice
Sampling temperature 1.0 (training); 0.7 (eval) Basic stochasticity for exploration/evaluation
Reward verifier Rule-based DAPO (binary); CompassVerifier-3B (eval) Easily deployed and interpretable

This configuration enables smooth training without the need for dynamic adjustment. It reflects field-wide consensus on PPO for LLM RLHF while showing generality across backbone architectures.

4. Training and Evaluation Protocols

4.1 Compute Regimen

JustRL-DeepSeek was trained for 4,380 gradient steps (≈15 days on 32 A800 GPUs, ≈1.4×10⁸k tokens). JustRL-Nemotron required 3,440 steps on the same hardware, processing ≈1.1×10⁸k tokens. Training is continuous—no multi-stage curricula, length annealing, or switching between phases.

4.2 Benchmarks and Sampling

Evaluation employs Pass@1 accuracy averaged over nine mathematically rigorous benchmarks:

  • AIME 2024/25
  • AMC 2023
  • MATH-500
  • Minerva Math
  • OlympiadBench
  • HMMT Feb 2025
  • CMIMC 2025
  • BRUMO 2025

For most benchmarks, N=32N=32 samples are generated per prompt; for MATH-500, Minerva, and OlympiadBench, N=4N=4 is used due to scale. Generation settings for evaluation are temperature=$0.7$, top-p=0.9p=0.9, and maximum $32k$ tokens.

5. Empirical Results and Comparative Analysis

5.1 Performance

Model Initial Avg. SOTA Baseline JustRL Score Compute Used Result Synopsis
JustRL-DeepSeek 37.65% ProRL-V2: 53.08% 54.87% 1.4×10⁸k (<½ ProRL-V2) Wins 6/9, single stage
JustRL-Nemotron 56.74% QuestA: 63.81% 64.32% 1.1×10⁸k (<½ QuestA) Wins 5/9, efficient SOTA

JustRL outperforms dynamic multi-stage competitors (ProRL-V2, QuestA) in both accuracy and compute efficiency, particularly on mathematical reasoning tasks. Despite leveraging weaker or strong initial backbones, single-stage RL augments performance substantially without additional complex interventions.

5.2 Training Dynamics

  • Policy entropy remains stable (oscillates around 1.2–1.4).
  • Mean reward increases monotonically (–0.6 → +0.4).
  • Response length decays naturally (from ∼8,000 to ∼4,000-5,000 tokens) without explicit penalty mechanisms.

Absence of plateaus, collapses, or unexpected training instabilities signals that curriculum schedules and penalties may be compensating for effects not inherent to large, stable baselines.

6. Ablation and Intervention Analysis

Ablations beginning with JustRL-DeepSeek uncovered the deleterious effects of commonly deployed stabilization tricks:

  • Adding “overlong penalty” drives performance to a plateau at ∼50% (from ∼55%).
  • Introducing DeepScaleR robust verifier further degrades outcomes (plateau at ∼45%).

Both interventions collapse policy entropy (∼0.5–0.6) and degrade performance, demonstrating that they suppress exploration under minimal, stable RL conditions. This cautions against routine use of length penalties and aggressive verifiers without first evaluating effects on simple baselines.

7. Field Implications, Open Questions, and Recommendations

JustRL’s findings advocate for a “less is more” approach where single-stage PPO, constant hyperparameters, and basic prompting suffice for SOTA RLHF at scale, with compute savings of 2× compared to elaborate alternatives. It implies that stabilization and complexity, often added incrementally, may inadvertently address instabilities of unnecessary interventions rather than inherent difficulties in the RL paradigm.

A plausible implication is that future work should anchor evaluations on robust, minimal protocols prior to introducing additional mechanisms, reserving complexity for clear empirical failures rather than prophylactic application. Key open questions remain regarding generalizability to coding, scaling to larger models, and robustness under noisy or less precise reward definitions.

JustRL establishes a simple and validated baseline for 1.5B-scale LLM mathematical reasoning, urging the research community to rigorously benchmark complex techniques against pared-down, stable alternatives before widespread adoption (He et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to JustRL.