Papers
Topics
Authors
Recent
2000 character limit reached

JustRL-DeepSeek-1.5B: Distilled Math Reasoning Model

Updated 22 December 2025
  • JustRL-DeepSeek-1.5B is a 1.5B-parameter language model designed for advanced mathematical and logical reasoning, distilled from the multi-stage RL-aligned DeepSeek-R1 teacher.
  • It employs a single-stage RL distillation using the GRPO objective, preserving stepwise chain-of-thought reasoning while reducing compute and latency demands.
  • Empirical benchmarks show competitive performance on math and discrimination tasks, with efficient deployment on secure computing platforms and low-resource hardware.

JustRL-DeepSeek-1.5B is a 1.5 billion parameter open-source LLM designed for advanced mathematical and logical reasoning, distilled from the multi-stage RL-aligned DeepSeek-R1 teacher. It serves as an efficient, open, and reproducible baseline for mathematical reasoning and discrimination tasks, particularly in contexts where compute and latency constraints are critical. Developed on the Qwen2.5-Math-1.5B backbone, JustRL-DeepSeek-1.5B exemplifies the distillation of large-scale RL behaviors into a compact and performant student, achieving competitive benchmark results with a streamlined training pipeline (DeepSeek-AI et al., 22 Jan 2025, He et al., 18 Dec 2025).

1. Model Architecture and Distillation Procedure

JustRL-DeepSeek-1.5B is a decoder-only transformer with 24 layers, 16-headed attention, rotary positional encodings, and a 32,000-token BPE vocabulary. The architecture mirrors Qwen2.5-Math-1.5B: pre-layer normalization, standard transformer blocks, and no modifications to attention or tokenization procedures ensure full compatibility with Qwen-based toolchains. All weights are full-precision, and the model uses a standard next-token cross-entropy output head (DeepSeek-AI et al., 22 Jan 2025, He et al., 18 Dec 2025).

The student model is produced via supervised fine-tuning (SFT) on ∼800,000 trajectories—∼600,000 high-quality reasoning chain-of-thoughts (CoTs) generated by the RL-trained DeepSeek-R1 teacher and ∼200,000 general SFT samples (QA, writing, etc.). No RL or value-regularization is performed at student scale. Distillation is strict maximum-likelihood: the student mimics full multi-step reasoning traces of the teacher, preserving deliberative CoT structure and stepwise logic (DeepSeek-AI et al., 22 Jan 2025).

2. Teacher RL Training and Distillation Rationale

The DeepSeek-R1 teacher employs a multi-stage RL pipeline:

  1. Cold-start SFT: Initial fine-tuning using ∼3,000 human-edited CoT examples for coherent, readable traces.
  2. First RL pass: Group Relative Policy Optimization (GRPO; a PPO-style clipped-policy method) optimized for reasoning accuracy (automated checker) and reasoning format. Groupwise normalized advantages and a symmetrized KL penalty promote exploration and consistency while preventing collapse:

JGRPO(θ)=Eq,{oi}[1GiLCLIP(θ)βDKL(πθπref)]J_{\text{GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\}} \left[ \frac{1}{G} \sum_{i} L_{\mathrm{CLIP}}(\theta) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]

with

LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]L_{\mathrm{CLIP}}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta) A_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t) \right]

where AtA_t is a per-group normalized advantage, ϵ=0.2\epsilon=0.2, β0.05\beta \approx 0.05, and G=816G=8-16 (DeepSeek-AI et al., 22 Jan 2025).

  1. Rejection-Sampling SFT: Generation and curation of a large CoT dataset by sampling multiple RL trajectories per prompt, discarding incorrect or unreadable traces, then SFT for two epochs.
  2. Second RL pass: RL with a mixture of rule-based accuracy rewards (for reasoning) and learned preference rewards (helpfulness, harmlessness) for broader alignment.

Student distillation then transfers only outputs—no further RL, KL, or MSE losses—allowing the student to acquire RL-originated reasoning skills through pure imitation (DeepSeek-AI et al., 22 Jan 2025).

3. JustRL: Minimal Single-Stage RL Recipe

The JustRL research program demonstrates that, for 1.5B-parameter LLMs, a single-stage RL recipe with fixed hyperparameters suffices to match or outperform complex, multi-phase pipelines. The RL procedure uses the Group Relative Policy Optimization (GRPO) objective, but omits explicit length penalties, entropy bonuses, KL terms, or curriculum scheduling (He et al., 18 Dec 2025).

Objective:

  • The model generates a sampled rollout for each prompt, with only terminal (episodic) binary reward; no shaping, no intermediate rewards.
  • The clipped policy-gradient loss is:

LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]L_{\mathrm{CLIP}}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t) \right]

with At=r(τ)bA_t = r(\tau) - b, rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_\theta^{\text{old}}(a_t|s_t), and ϵ=0.28\epsilon = 0.28 (He et al., 18 Dec 2025).

Protocol:

  • Batch size: 256 prompts; 8 rollouts per prompt.
  • Optimizer: AdamW, LR=1e-6.
  • No auxiliary regularization or verifier; reward is from the DAPO rule-based checker.
  • Hardware: 32×A800-80 GB GPUs; approx. 11,500 GPU hours for training.
  • Token budget: 1.4×1081.4 \times 10^8 tokens.

Hyperparameters generalize: identical values work for both DeepSeek and Nemotron 1.5B models without tuning; policy entropy consistently remains in [1.2,1.4][1.2, 1.4] throughout (He et al., 18 Dec 2025).

4. Empirical Results and Discrimination Capabilities

JustRL-DeepSeek-1.5B achieves state-of-the-art performance on mathematical reasoning and discrimination tasks for models in its parameter class. Key results include:

Model AIME24 AIME25 AMC23 MATH Minerva Olympiad HMMT BRUMO CMIMC Avg
DeepSeek-R1-1.5B (init) 29.90 22.40 63.82 84.90 34.65 45.95 13.44 30.94 12.89 37.65
DeepScaleR 40.21 28.65 73.83 89.30 39.34 52.79 18.96 40.00 21.00 44.88
ProRL-V2 51.87 35.73 88.75 92.00 49.03 67.84 19.38 47.29 25.86 53.08
JustRL-DeepSeek 52.60 38.75 91.02 91.65 51.47 67.99 21.98 52.71 25.63 54.87

The model leads on six of nine benchmarks, outperforming prior methods (including ProRL-V2 and DeepScaleR) while using approximately half the compute budget (He et al., 18 Dec 2025). Notably, on AIME24, AIME25, and Minerva, it shows substantial uplifts.

For discrimination/planning roles, JustRL-DeepSeek-1.5B (and related DeepSeek-R1-1.5B variants) outperforms non-reasoning LLMs of much larger scale. On text-to-SQL planning, it yields up to 58% higher macro F1 and 3.7% execution-accuracy improvement versus CodeLlama-13B (Anjum, 30 Apr 2025). CoT-style reasoning enables robust evaluation of candidate solutions, with CoT scoring matching only the verdict logit:

S(τ)=i=1Nαilogp(xix<i),S(\tau) = \sum_{i=1}^N \alpha_i \log p(x_i|x_{<i}),

where only αN=1\alpha_N=1 for the final verdict token (Anjum, 30 Apr 2025).

5. Training Stability, Ablations, and Practical Insights

JustRL-DeepSeek-1.5B exhibits monotonic, stable improvement over the course of training, with no need for dynamic hyperparameter schedules or intervention. Policy entropy remains stable, mean episodic reward rises smoothly, and response length naturally compresses in early training—indicating efficient learning without explicit length penalties (He et al., 18 Dec 2025).

Ablation studies demonstrate that standard interventions may be counterproductive at this scale:

  • Length penalties: Adding token-length penalties collapses exploration, with entropy dropping to ~0.6 and accuracy plateauing below baseline.
  • Robust verifiers: Use of stricter verifiers results in lower learning signal, reduced entropy, and significantly lower accuracy.

A plausible implication is that the additional complexity in multi-stage RL pipelines is unnecessary—or even detrimental—if a stable single-stage RL baseline is established and appropriately scaled (He et al., 18 Dec 2025).

6. Confidential Inference and Efficiency

JustRL-DeepSeek-1.5B is efficient to deploy and particularly well suited to confidential computing scenarios:

  • Confidential computing: In secure TDX enclaves, DeepSeek-R1-1.5B achieves 25.7 tokens/s (38.9 ms/token), surpassing CPU-only Docker (10.3 tokens/s) and providing strong confidentiality guarantees. For workloads under 2B parameters, TDX typically offers both security and a modest throughput boost; with larger models, memory bound constraints reverse this advantage (Dong et al., 17 Feb 2025).
  • Deployment profiles: The model fits on <4 GB VRAM at 4-bit quantization, enabling deployment on an 80 GB A100 or 40 GB card. End-to-end 512-token CoTs are generated in ~200 ms (DeepSeek-AI et al., 22 Jan 2025).
  • Optimal usage: For highly sensitive data, run inference inside a TDX enclave; for lowest latency, use GPU outside enclave boundaries with encrypted I/O. For best throughput, batch and quantize aggressively.

Best practices include allocating ≥32 vCPUs and ≥100 GB RAM for TDX, batch sizing to amortize TEE transition, memory pinning, and thread-affinity tuning. Hybrid secure CPU/GPU workflows are recommended once secure GPU virtualization matures (Dong et al., 17 Feb 2025).

7. Roles, Limitations, and Open Directions

JustRL-DeepSeek-1.5B excels as a discriminator (evaluator/ranker) in agentic and planning systems, surpassing much larger non-reasoning models in candidate selection, but underperforms as a generator—mirroring empirical findings that evaluation is easier than synthesis for compact reasoning LLMs. Increasing compute or context windows shows sharply diminishing returns. Further, since the student is distilled (not RL-fine-tuned), teacher errors may propagate, and code performance remains below larger Qwen variants (Anjum, 30 Apr 2025, DeepSeek-AI et al., 22 Jan 2025).

Broadly, JustRL-DeepSeek-1.5B establishes that robust mathematical-reasoning LLMs can be efficiently realized at small scale using distilled RL traces, and that complexity in RL training pipelines can be substantially reduced without sacrificing accuracy or diversity (He et al., 18 Dec 2025, DeepSeek-AI et al., 22 Jan 2025). Continued research into compact, RL-distilled architectures, efficient confidential inference, and dynamic reward strategies for tool-integrated scenarios remain open avenues.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to JustRL-DeepSeek-1.5B.