Nemotron-Research Reasoning Qwen-1.5B

Updated 19 September 2025

Nemotron-Research-Reasoning-Qwen-1.5B is a 1.5B-parameter language model that enhances reasoning through prolonged reinforcement learning and modular vector transfer.
It employs efficient rollout techniques, KL regularization, and advanced RL strategies to improve performance in math, coding, and logical reasoning tasks.
Empirical results show significant pass@1 gains and robust generalization, validating its innovative training methodology and practical efficiency.

Nemotron-Research-Reasoning-Qwen-1.5B refers to a 1.5-billion-parameter reasoning LLM constructed by the Nemotron project. This model embodies modern advancements in fine-grained LLM reasoning, with a particular emphasis on expanding reasoning capacity beyond standard supervised or distilled LLMs. The design and methodology incorporate several key research streams: prolonged reinforcement learning (RL) with advanced stability controls; efficient rollout and filtering methods; integration of “reasoning vector” transfer for modular reasoning ability enhancement; and practical techniques for output efficiency and robustness. This entry provides a comprehensive overview of the model’s architecture, core training methodologies, empirical performance, and technical context as documented in peer-reviewed and open-access research.

1. Model Architecture and Series Placement

Nemotron-Research-Reasoning-Qwen-1.5B builds upon the Qwen series architecture, which itself is a descendant of LLaMA-like Transformer models modified for stability and generalization (Bai et al., 2023). Qwen models including the 1.8B-scale baseline employ:

Untied input embedding and output projection matrices for improved training dynamics.
Rotary Positional Embedding (RoPE) in FP32 for high-precision context extension.
RMSNorm with pre-normalization strategy for stabilization, replacing classic layer normalization.
SwiGLU nonlinearity and feed-forward dimensions set to $(8/3)\times$ hidden size, reducing memory footprint versus the classical $4\times$ configuration.

The model is pretrained on trillions of tokens (text and code) using next-token prediction, and supervised finetuning incorporates the ChatML conversational format when alignment data is used. Efficient training leverages Flash Attention, AdamW optimizer, and a cosine learning rate schedule.

Nemotron-Research-Reasoning-Qwen-1.5B specifically integrates further RL-based optimization to transcend standard baseline performance. Prolonged RL (ProRL) and stability measures like KL regulation and periodic reference resets distinguish this variant from earlier Qwen models.

2. Prolonged Reinforcement Learning and Training Dynamics

The central distinguishing methodology for Nemotron-Research-Reasoning-Qwen-1.5B is its utilization of advanced RL algorithms aimed at expanding the model’s reasoning boundaries. The process is structured around Group Relative Policy Optimization (GRPO), an RL algorithm that computes the advantage function by normalizing the reward within groups of rollouts:

$L_{\text{GRPO}}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \min \left( r_\theta(\tau) A(\tau), \text{clip}(r_\theta(\tau), 1-\epsilon, 1+\epsilon)A(\tau) \right) \right]$

with $A(\tau) = (R_\tau - \text{mean}(\{R_i\}_{i\in G(\tau)})) / \text{std}(\{R_i\}_{i\in G(\tau)})$ .

To maintain entropy and prevent output collapse, a KL regularization term with a small coefficient $\beta$ penalizes deviation from a periodically updated reference model:

$L_{\text{KL-RL}}(\theta) = L_{\text{GRPO}}(\theta) - \beta \cdot D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})$

When the KL term’s contribution stagnates or dominates, the reference policy and optimizer state are reset to reinvigorate exploration. This structure—periodically refreshed KL anchors, diverse task sampling, and verifiable rewards for binary (math/code) correctness—enables Nemotron-Research-Reasoning-Qwen-1.5B to systematically expand the “reachable” reasoning space.

3. Enhancement and Efficiency via Auxiliary Techniques

The ProRL backbone is augmented by several recent advances to improve efficiency and performance:

GRESO (GRPO with Efficient Selective Rollout) (Zheng et al., 2 Jun 2025): By tracking prompts with zero-variance rewards over epochs, rollouts can be pre-emptively skipped for those that are consistently uninformative, yielding up to $2.4\times$ wall-clock speedup in rollout and $2.0\times$ total training time reduction without degrading accuracy.
ShorterBetter (Sample Optimal Length RL) (Yi et al., 30 Apr 2025): RL reward incorporates the Sample Optimal Length (SOL)—the minimum length among correct completions in a group—driving the model to both maximize accuracy and minimize redundant reasoning steps. Reward is explicitly formulated as: $r_i = \alpha \cdot I(\hat{y}_i = y_{gt}) - \beta \cdot |\ell^*(G) - \ell(o_i)|$ with $I(\cdot)$ an indicator for correct answers.
Suppression of Self-Affirmation Reflections (Liu et al., 14 Jun 2025): Post-hoc pruning of low-probability “reflection” tokens (e.g., “wait”) after correct reasoning steps achieves up to $18.7\%$ reduction in output length without accuracy loss, deployable directly in inference frameworks.
Tool-Integrated Reasoning: Incorporating code execution (TIR) into chain-of-thought trajectories via supervised (and possibly RL/refinement) methods improves hard math and coding task accuracy (Moshkov et al., 23 Apr 2025).

4. Empirical Performance and Benchmark Results

Nemotron-Research-Reasoning-Qwen-1.5B, after extended RL (ProRL), consistently outperforms its distilled base model and scale-matched non-reasoning LLMs on diverse benchmark tasks (Liu et al., 16 Jul 2025, Yang et al., 14 May 2025, Liu et al., 30 May 2025, Li et al., 17 Jul 2025):

Domain	Pass@1 Improvement	Notes
Math (AIME, etc.)	+14.7% over base model	Pass@1, sustained across multi-step and harder OOD reasoning benchmarks
Coding	+13.9%	Measured as pass@1 in e.g. Codeforces competitive coding challenges
Logic Puzzles	+54.8%	Validated on synthetic Reasoning Gym and other generative tasks
STEM Reasoning	+25.1%	Benchmarked on SCP-116K, GPQA Diamond
Instruction Follow	+22.0%	Eval using IFEval with strict verifiers

Notably, with QuestA-style RL training via question augmentation, Nemotron-1.5B attains state-of-the-art results for its scale: 67.1% (AIME24), 59.5% (AIME25), and 35.5% (HMMT25), surpassing previous small-model records (Li et al., 17 Jul 2025).

Additionally, REDI-based distillation (using positive and negative traces) yields 83.1% pass@1 on MATH-500—matching or exceeding DeepSeek-R1-Distill-Qwen-1.5B using only open data (Xu et al., 30 May 2025). Reasoning vector task arithmetic allows efficient transfer of RL-induced reasoning ability across compatible models, producing consistent +4.9% to +12.3% accuracy improvements on benchmarks like GSM8K and BigBenchHard (Zbeeb et al., 1 Sep 2025).

5. Model Robustness, Generalization, and Limitations

Several empirical phenomena characterize the boundaries and robustness of Nemotron-Research-Reasoning-Qwen-1.5B:

Boundary Expansion: Prolonged RL enables the model to generalize to out-of-distribution tasks that base models cannot solve at all (pass@128 = 0), and to discover new reasoning paths and strategies (documented via creativity index analysis) (Liu et al., 30 May 2025).
Reasoning as Discrimination vs. Generation: The model is particularly effective as a discriminator in candidate evaluation (e.g., text-to-SQL), outperforming much larger non-reasoning LLMs, but may be less effective as a solution generator compared to those larger models (Anjum, 30 Apr 2025).
RL Compute Scaling: Performance correlates not only with training duration and task diversity, but also with the “hardness” of base model domains—domains with lower initial competence see the most substantial gains (Liu et al., 30 May 2025, Liu et al., 16 Jul 2025).
Sample Efficiency: QuestA-style question augmentation increases sample efficiency for hard problems by converting sparse-reward tasks into denser ones via partial solution hints, theoretically improving convergence and generalization (Li et al., 17 Jul 2025).

Limitations: There are diminishing returns from excessively increasing the test-time compute budget or prompt context, and attempts to simply provide more tokens or schema do not always enhance performance—potentially even harming it due to increased repetition or distraction (Anjum, 30 Apr 2025). Quantization must be handled with care: while 8-bit or 4-bit quantization (weights/act) is often lossless, more aggressive settings induce significant accuracy drops, particularly in smaller models and on complex, multi-step tasks (Liu et al., 7 Apr 2025).

6. Future Directions and Methodological Extensions

Several methodological innovations and open research avenues arise from the Nemotron-Research-Reasoning-Qwen-1.5B framework:

Further RL and “Boundary Expansion”: Prolonged RL offers systematic reasoning improvement, but the optimal schedule for KL resets, reward shaping, and sampling strategies require ongoing refinement as model and task scale increase (Liu et al., 30 May 2025, Liu et al., 16 Jul 2025).
Task Arithmetic and Reasoning Vectors: The ability to transfer reasoning ability via task vectors suggests new means of modularizing cognitive capabilities—adapting them via simple arithmetic rather than full retraining (Zbeeb et al., 1 Sep 2025).
Prompt and Output Compression: Compact reasoning traces (via dynamic SOL targets or reflection suppression) reduce compute overhead and may facilitate applications in low-latency or edge settings.
RL Efficiency and Scaling: Algorithms such as GRESO demonstrate that selective rollout and adaptive batch sizing allow scaling RL to larger and more data-intensive settings without prohibitive resource costs (Zheng et al., 2 Jun 2025).
Tool Integration and Hybrid Models: Continued development integrating code execution, knowledge retrieval, or hybrid architectures (e.g., Mamba-based models) is likely to extend the reasoning and efficiency frontier further (Wang et al., 14 Apr 2025).
Dataset Expansion: Leveraging large, diverse, and high-quality datasets (e.g., OpenMathReasoning) remains central to robust mathematical and STEM reasoning. Generative solution selection methods (e.g., GenSelect) may further drive post-processing accuracy (Moshkov et al., 23 Apr 2025).

Nemotron-Research-Reasoning-Qwen-1.5B represents a paradigmatic shift in small-scale LLM reasoning, emphasizing explicit expansion of reasoning boundaries through sophisticated RL, efficient distillation, modular transfer of reasoning ability, and output optimization. Its methodological advances and open-source release provide a foundation for continued research into autonomous, verifiable, and efficient AI reasoning systems.