JustRL-Nemotron-1.5B: Single-Stage RL Optimization

Updated 22 December 2025

JustRL-Nemotron-1.5B is a 1.5B-parameter model designed for advanced reasoning tasks using a streamlined, single-stage PPO reinforcement learning approach.
It employs a fixed hyperparameter setup with binary reward signals, avoiding curriculum learning and staged training to maintain simplicity and stability.
The model achieves state-of-the-art performance in both mathematical reasoning benchmarks and tool-calling tasks, demonstrating compute efficiency and robust transferability.

JustRL-Nemotron-1.5B denotes a class of 1.5 billion–parameter LLMs optimized for complex reasoning tasks using a minimal, single-stage reinforcement learning (RL) recipe. Originating in multiple contemporaneous research lines, most notably the OpenMath-Nemotron-1.5B and Qwen2.5-1.5B backbones, the methodology is distinguished by its strict adherence to fixed hyperparameters, simple binary reward signals, and the complete avoidance of curriculum learning, staged training, or auxiliary scheduling. This approach challenges the need for complexity in RL pipelines for LLMs, showing that state-of-the-art results can be achieved—even at small model scales—through a stable, uncluttered application of Proximal Policy Optimization (PPO) and related policy-gradient variants (He et al., 18 Dec 2025, Zhang et al., 25 Apr 2025).

1. Model Architectures and Data Regimes

JustRL-Nemotron-1.5B refers to two closely related instantiations:

OpenMath-Nemotron-1.5B (reasoning on mathematical problems)
Qwen2.5-1.5B (tool-calling with format-constrained JSON outputs)

Both use transformer-decoder architectures with 1.5 billion parameters. Architecturally, Qwen2.5-1.5B is assigned (by public convention, not direct reporting in (Zhang et al., 25 Apr 2025)) approximately 24 layers, 2048 hidden units, and 16 attention heads. The OpenMath-Nemotron-1.5B backbone relies on a byte pair encoding (BPE) vocabulary shared with the broader OpenMath family and supports maximum context windows of 16,000 tokens (with up to 15,000 for model-generated completions) (He et al., 18 Dec 2025). In the tool-using regime, the model first generates a > ...-tagged explanation, then a <tool_call>...</tool_call> JSON block satisfying strict syntactic and semantic constraints (Zhang et al., 25 Apr 2025).

Training data varies by use-case domain: mathematical JustRL-Nemotron utilizes nine held-out math benchmarks, while tool-use models leverage xLAM (~60,000 APIs) and ToolACE data.

2. Reinforcement Learning Formulation

Both lines operationalize the problem as an episodic, binary-reward RL task, eschewing dense rewards, partial credit, or trajectory-level imitation (He et al., 18 Dec 2025, Zhang et al., 25 Apr 2025).

Mathematical Reasoning: At each training step, $N=8$ rollouts are sampled per prompt. Reward is:

$r(s, a) = \begin{cases} 1, & \text{if the final boxed answer passes DAPO’s rule-based verifier;} \ 0, & \text{otherwise.} \end{cases}$

Tool Use: Reward function in the tool-calling regime is:

$r(c, a) = 1_{\textrm{valid} \wedge \textrm{correct}}(\text{tool\_call}(a))$

where both format and exact function/argument correctness are required.

Advantage estimation uses Generalized REINFORCE with baseline (GRPO):

$\hat{A}_t = \sum_{\ell=0}^{T-t} (\gamma\lambda)^\ell\,\delta_{t+\ell}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

with $\gamma=0.99$ , $\lambda=0.95$ . Clipped PPO is used for policy optimization:

$L^{\rm CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\,\hat{A}_t, \mathrm{clip}(r_t(\theta),\,1-\epsilon,\,1+\epsilon)\,\hat{A}_t\right)\right]$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\rm old}}(a_t|s_t)}$ and $\epsilon \in \{0.20, 0.28\}$ (clip range $[0.8, 1.28]$ ).

Notably, there is no explicit KL-divergence penalty (mathematical regime) or entropy regularization; exploration is promoted solely by stochastic sampling and PPO’s implicit regularization. In the tool-using regime, a small KL coefficient is used with default settings.

3. Training Configuration and Compute Budget

All settings are single-stage, fixed throughout training, and designed for simplicity and reproducibility.

Hyperparameter	JustRL-Nemotron-1.5B (Math)	JustRL-Nemotron-1.5B (Tool)
Advantage estimator	GRPO	GRPO
KL-loss coefficient	0	$1×10^{-3}$
Entropy bonus	0	0
Train batch size	256 rollouts	1024 contexts
PPO mini-batch size	64	not stated
Learning rate	$1×10^{-6}$ (constant)	$1×10^{-6}$
Sampling temperature	1.0	0.7
Max prompt/response/context	1k/15k/16k tokens	not specified
Hardware (per experiment)	32×A800 80GB GPUs	4×8 H100 GPUs

Total training spans approximately $3.4\times10^3$ gradient steps (mathematical regime) and processes $1.1\times10^{11}$ tokens, requiring about $5.0\times10^{20}$ floating-point operations ( $\sim 15$ days wall-time on specified compute) (He et al., 18 Dec 2025, Zhang et al., 25 Apr 2025).

4. Quantitative Performance and Scaling Behavior

Mathematical Reasoning

JustRL-Nemotron-1.5B achieves or surpasses prior models’ state-of-the-art with a compute-efficient approach. On nine held-out mathematical benchmarks (using Pass@1 or Pass@4 as appropriate):

Overall average accuracy: 64.3% (compared to 63.8% for QuestA and 56.7% for the OpenMath-Nemotron backbone)
Example: On AMC 2023, achieves 96.0%, and on AIME 2024, 69.7%
All benchmarks are listed in detail in (He et al., 18 Dec 2025); see table below.

Benchmark	Backbone	QuestA	JustRL-Nemotron
AIME 2024	58.75%	71.56%	69.69%
AIME 2025	48.44%	62.08%	62.92%
AMC 2023	90.55%	93.44%	96.02%
(Plus 6 other sets; full listing in (He et al., 18 Dec 2025))

JustRL-Nemotron matches or slightly exceeds QuestA while using 2.4× less compute.

Tool-Using Regime

For Qwen2.5-1.5B on BFCL (tool-use benchmark):

SFT baseline: 75.5%
JustRL: 76.2% (≈+0.7 percentage point gain)
Trends indicate modest improvements at small scale, but much larger gains at 7B and 14B parameters (up to +4.5pp) (Zhang et al., 25 Apr 2025).

5. Ablation Studies and Insights

Systematic ablations illustrate that “standard tricks” prevalent in complex RL pipelines can degrade performance in this setting.

Adding explicit length penalties or a robust verifier in the mathematical regime led to severe entropy collapse (policy entropy drops from ~1.2–1.4 to 0.5–0.6) and reduction in final accuracy (from 55% to 45% on AIME 2024, after 3000 steps).
In the tool-use regime, enforcing fine-grained partial rewards facilitates reward hacking and degrades generalization, whereas a strict binary reward focusing on both format and execution correctness is optimal (Zhang et al., 25 Apr 2025).
Dropping the <think> reasoning constraint in tool-calling causes up to 4 percentage points loss in accuracy.

The evidence suggests that for models at this scale, “standard tricks” targeting exploration or output regulation are unnecessary, and in some cases counterproductive.

6. Training Dynamics and Transferability

Training curves for JustRL-Nemotron show monotonic, stable learning absent the collapses or plateaus that motivate adaptive interventions in other systems:

Policy entropy oscillates stably (healthy exploration), with no drift or collapse.
Mean reward improves smoothly, with no plateaus or sudden drops.
Mean response length naturally declines (e.g., from ~8000 to ~4000–5000 tokens) without any explicit length penalty.

The same fixed hyperparameters, including batch size and learning rate, transfer across both DeepSeek-Qwen-1.5B and OpenMath-Nemotron-1.5B, with no per-backbone tuning required; both reach state-of-the-art in class (He et al., 18 Dec 2025).

7. Comparative Analysis and Limitations

JustRL outperforms supervised fine-tuning (SFT) and SFT-then-RL baselines due to:

Flexible generalization (binary format+functional reward accommodates acceptable variance in JSON outputs and tool-calls)
Lightweight supervision (no need for trajectory-level distillation or trace annotation)
Reduced reward hacking (removal of partial credit reduces exploitative behavior)
Increased effectiveness with larger models (limited gains seen at <1.5B parameters, much larger improvements at 7B/14B) (Zhang et al., 25 Apr 2025)

Limitations include diminished RL signal at smaller model scale, sensitivity to format correctness, and higher training costs relative to SFT for equivalent parameter counts.

In summary, JustRL-Nemotron-1.5B demonstrates that, for both mathematical reasoning and tool-calling domains, a stable, single-stage PPO-based RL regime with fixed, minimal hyperparameters suffices to attain or surpass more complex, resource-intensive approaches. The methodology offers a robust, reproducible baseline and suggests that escalation of RL pipeline complexity may not be justified at these scales (He et al., 18 Dec 2025, Zhang et al., 25 Apr 2025).