Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 162 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 73 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Apriel-Nemotron-15B Thinker

Updated 11 November 2025
  • Apriel-Nemotron-15B-Thinker is a 15-billion parameter transformer that uses depth-scaling of the Mistral-Nemo architecture to achieve robust reasoning and efficient training.
  • It employs a four-stage pipeline—including upscaling, continual pre-training, supervised fine-tuning, and GRPO reinforcement learning—to optimize advanced reasoning and automation tasks.
  • The model halves memory and compute requirements relative to 32B models, enabling on-premise deployment for code generation, complex mathematics, and retrieval-augmented applications.

Apriel-Nemotron-15B-Thinker is a 15-billion parameter transformer model developed within the ServiceNow Apriel SLM series, designed to deliver enterprise-grade reasoning capabilities—spanning code, mathematics, and diverse workflow automation—while halving the memory and compute requirements typical of contemporary 32-billion parameter models. Its design, training regimen, and evaluation emphasize resource efficiency, modular upscaling, and robust performance across standard academic and industrial benchmarks (Radhakrishna et al., 13 Aug 2025).

1. Model Architecture and Upscaling

Apriel-Nemotron-15B-Thinker iteratively extends the Mistral-Nemo-Base-2407 architecture, which is in itself a 12-billion parameter (12B) encoder-decoder transformer. The main architectural parameters are:

  • Hidden dimension: approximately 2048
  • Attention heads: 32
  • Feed-forward inner dimension: ~8192
  • Attention: multilingual self-attention with rotary position embeddings

The transition from 12B to 15B parameters was achieved exclusively by depth-scaling—specifically, by duplicating selected intermediate transformer layers without altering the width (i.e., hidden size or MLP expansion). This particular choice was driven by experimental observations: width-scaling increased training instability in early trials, whereas layer duplication consistently produced low initial loss and smooth convergence. The final 15B model comprises three additional duplicated transformer layers, maintaining identical per-layer structure and dimensions across the model.

There are no further architectural modifications relative to the base—no mixture-of-experts (MoE), sparse attention, or novel sub-blocks. Retaining the vanilla Mistral block architecture was intended to streamline checkpoint merging and model averaging throughout the subsequent training pipeline.

2. Four-Stage Training Pipeline

The Apriel-Nemotron-15B-Thinker training process comprises four major stages, each contributing distinct competencies and specialization:

Stage 1: Base Model Upscaling

  • Loss: next-token prediction (MLM/causal LM objective)
  • Dataset: ~100B tokens, sourced evenly between a Mistral-Nemo style web/text/code mixture and additional curated data (high-quality web content, StackExchange, multilingual code, technical/mathematical literature)
  • Batch size: 768 × 16K sequences (~12M tokens per batch), with a fixed learning rate of 5×1055 \times 10^{-5}.
  • Result: the "Apriel-Nemotron-15B-Base" checkpoint serves as the foundation for downstream specialization.

Stage 2: Continual Pre-Training (CPT)

  • Objective: to enhance complex reasoning and chain-of-thought (CoT) capacities.
  • Composition: data are packed to 16K tokens with cross-document attention masking; split as 60% explicit reasoning examples, 25% chain-of-thought traces, and 15% generic replay from Stage 1.
  • Learning schedule: 68B tokens processed, AdamW optimizer with weight decay 0.1, initial LR 5×1055 \times 10^{-5} following a cosine decay to 5×1065 \times 10^{-6}, 10% linear warmup.
  • Model merging: three evenly-spaced checkpoints are averaged to produce "Apriel-CPT".

Stage 3: Supervised Fine-Tuning (SFT)

  • Objective: explicit training for reasoning traces, advanced mathematical problem-solving, function calling, retrieval-augmented generation (RAG), and formal coding tasks.
  • Datasets:
    • Balanced SFT: 1M samples over three epochs (mixed instructions, retrieval, coding, multi-turn dialogue)
    • Math SFT: 200K samples over eight epochs (emphasis on advanced/olympiad math, ≥3 generated solutions per prompt)
    • Small reasoning SFT: 15K samples, used primarily for comparative evaluations of CPT effectiveness
  • Training: sequences of up to 32K tokens, with supervised loss computed only for tokens corresponding to the "assistant" role.
  • Model merging: the best checkpoints from balanced (A) and math-specialized (B) SFT are equally combined (C) for further use.

Stage 4: Reinforcement Learning with Group Relative Policy Optimization (GRPO)

  • RL Algorithm: GRPO, optimizing the reward-augmented log-likelihood penalized by KL divergence from the SFT prior.

maxθ  ExD,aπθ(x)[R(x,a)    β  KL[πθ(x)πSFT(x)]]\max_{\theta}\; \mathbb{E}_{x\sim\mathcal{D},\,a\sim\pi_\theta(\cdot|x)}\Big[ R(x,a) \;-\;\beta\; \mathrm{KL}\bigl[\pi_\theta(\cdot|x)\,\|\,\pi_{\mathrm{SFT}(\cdot|x)}\bigr] \Big]

  • Rewards:
    • Output formatting adherence
    • Correctness in advanced math across filtered prompts (18K kept where at least one of eight samples was correct and three were wrong)
    • Precise instruction following (14K synthetic, compositional prompts)
    • Python/JS code: proportion of test cases passed
    • Agentic tool invocation tasks (32K single-turn prompts)
  • Sampling: eight candidate completions per prompt (temperature = 1.0, top-pp = 0.95), batch size 512, learning rate of 1×1061 \times 10^{-6}, KL penalty coefficient 0.001.
  • Model merges: multiple stages (e.g., E=0.5C+0.5DE = 0.5C + 0.5D), culminating in the final model as a convex combination of the last three RL checkpoints.

3. Memory Footprint and Computational Efficiency

The model's central design goal is a significant reduction in resource consumption relative to 30–32B parameter baselines. Empirical measurements demonstrate:

  • Apriel-15B supports inference and training on a single NVIDIA H100 (80GB) or dual consumer-grade GPUs (2 × 48GB), while comparable 32B models require at least dual 80GB H100s or advanced sharding.
  • Memory usage (fp16, optimizer, activations):
    • Apriel-15B: approximately 90GB peak, reduced to 60GB with ZeRO-1 or data parallelism
    • 32B models: typically ~180GB, requiring specialized hardware or heavy pipeline parallelism
Model Parameters Raw fp16 Weights Total Memory (with optimizer) Typical Inference GPU
O1-mini (est.) 30 B 60 GB ~120 GB 2 × 80 GB
QWQ-32B 32 B 64 GB ~130 GB 2 × 80 GB
ExaOne-Deep-32B 32 B 64 GB ~130 GB 2 × 80 GB
Apriel-Nemotron-15B 15 B 30 GB ~ 60 GB 1 × 80 GB or 2 × 48 GB

This resource profile enables deployments in environments where larger models are infeasible, such as on-premise or air-gapped enterprise settings.

4. Empirical Benchmarking and Performance

Apriel-Nemotron-15B-Thinker was evaluated under zero-shot settings (sampling temperature T=0.6, max 32K tokens) across a broad suite of enterprise and academic reasoning benchmarks. Key results include:

Enterprise Benchmarks

Benchmark Flash-3 Nano-8B QWQ-32B O1-mini ExaOne-32B Apriel-15B
MBPP (pass@1) 80.2% 73.8% 88.2% 93.1% 76.8% 85.8%
BFCL-live-V2 77.4% 54.2% 79.0% 81.0% 75.4% 75.4%
Enterprise RAG 57.9% 11.1% 65.2% 66.5% 52.1% 69.2%
MT-Bench 8.43 7.43 8.46 8.38 8.39 8.57
MixEval 79.1% 62.1% 77.3% 82.9% 80.6% 82.8%
IFEval 81.6% 69.8% 82.8% 79.5% 83.1% 84.6%
MultiChallenge 24.5% 16.1% 37.7% 30.8% 38.5% 36.6%

Apriel-15B leads on MT-Bench and IFEval, is second on MBPP and MixEval, and is competitive on other benchmarks relative to 32B models.

Academic Reasoning Benchmarks

Benchmark Flash-3 Nano-8B QWQ-32B O1-mini ExaOne-32B Distill-QWQ-32B Apriel-15B
GPQA-Diamond 53.5% 54.0% 66.7% 60.0% 65.2% 71.5% 57.4%
MATH-500 84.0% 89.6% 90.8% 90.0% 91.6% 97.5% 91.6%
AIME’24 38.7% 62.0% 81.3% 63.6% 76.0% 79.8% 73.3%
AIME’25 25.3% 48.7% 68.7% 54.8% 64.7% 66.8% 60.0%
MMLU-Pro 66.9% 61.5% 79.0% 80.3% 73.9% 84.0% 73.4%
AMC23 83.1% 93.5% 98.5% 92.5% 95.0% 99.0% 95.0%
LiveCodeBench 44.8% 53.2% 65.9% 53.8% 62.4% 65.9% 54.6%

Apriel-15B demonstrates strong performance on math-intensive tasks (MATH-500, AMC23) and is competitive on AIME challenges, while lagging the largest distillation models on code execution.

Token Efficiency

On problem sets such as AIME-24, AIME-25, GPQA-Diamond, and MATH-500, Apriel-15B uses 8–10K tokens for reasoning ("thinking") per instance, whereas comparably performing 32B models consume 13–20K tokens per instance—a 30–50% reduction in token usage.

5. Ablation Studies and Training Regimen Analysis

Detailed ablation experiments delineate the effect of each stage in the training pipeline:

Upscaling vs. Training from Scratch

Layer-doubling and pretraining on 100B tokens deliver 5–10% absolute gains over the 12B Mistral baseline on GSM8K, HumanEval, BBH, and MMLU.

Continual Pre-Training (CPT) Impact

Benchmark Before CPT After CPT Change
Arc 68.5% 65.3% −3.2
GSM8K 78.5% 82.0% +3.5
HumanEval 85.3% 83.5% −1.8
BBH 57.5% 57.2% −0.3
GPQA 31.6% 32.7% +1.1
Minerva Math 45.3% 49.2% +3.9
IF-Eval 22.4% 36.8% +14.4
Average 57.3% 58.1% +0.8

CPT produces consistent gains for math/logic tasks (Minerva Math, GSM8K, IF-Eval), while causing minor regressions on some common-sense tasks.

SFT on CPT vs. Pre-CPT

Benchmark SFT before CPT SFT after CPT Change
GPQA Diamond 37.2% 46.5% +9.3
MATH-500 80.4% 90.8% +10.4
AIME-24 16.0% 58.0% +42.0
AIME-25 18.4% 45.99% +27.6
AMC23 59.5% 96.0% +36.5

This demonstrates that CPT can substantially enhance the effectiveness of SFT, especially for advanced mathematical reasoning.

6. Implications, Use Cases, and Open Questions

Apriel-Nemotron-15B-Thinker’s halved resource requirements vis-à-vis mainstream 32B models enable practical deployment on a single H100 or affordable dual-GPU configurations. This reduction directly translates into decreased inference latency, lower cost, and more accessible on-premise and air-gapped deployments, with particular suitability for secure retrieval-augmented generation (RAG), automated workflow orchestration, and code generation scenarios.

Despite strong performance across reasoning and mathematical tasks, there are persistent limitations, including:

  • Inferior code execution ability on benchmarks such as LiveCodeBench (54.6%) compared to distilled 32B models (>65%)
  • Absence of width scaling or sparse/MoE layers, which may yield additional gains at the 15B parameter scale
  • RL rewards remain hand-crafted; use of learned preference models may further improve instruction following and generative alignment
  • Validation of quantized inference for reduced-precision (8-bit, 4-bit) operation remains incomplete

A plausible implication is that future research could address these limitations by exploring width scaling stability, automated reward design, and quantized inference strategies, potentially advancing the efficiency frontier established by Apriel-Nemotron-15B-Thinker.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Apriel-Nemotron-15B-Thinker.