Apriel-Nemotron-15B Thinker

Updated 11 November 2025

Apriel-Nemotron-15B-Thinker is a 15-billion parameter transformer that uses depth-scaling of the Mistral-Nemo architecture to achieve robust reasoning and efficient training.
It employs a four-stage pipeline—including upscaling, continual pre-training, supervised fine-tuning, and GRPO reinforcement learning—to optimize advanced reasoning and automation tasks.
The model halves memory and compute requirements relative to 32B models, enabling on-premise deployment for code generation, complex mathematics, and retrieval-augmented applications.

Apriel-Nemotron-15B-Thinker is a 15-billion parameter transformer model developed within the ServiceNow Apriel SLM series, designed to deliver enterprise-grade reasoning capabilities—spanning code, mathematics, and diverse workflow automation—while halving the memory and compute requirements typical of contemporary 32-billion parameter models. Its design, training regimen, and evaluation emphasize resource efficiency, modular upscaling, and robust performance across standard academic and industrial benchmarks (Radhakrishna et al., 13 Aug 2025).

1. Model Architecture and Upscaling

Apriel-Nemotron-15B-Thinker iteratively extends the Mistral-Nemo-Base-2407 architecture, which is in itself a 12-billion parameter (12B) encoder-decoder transformer. The main architectural parameters are:

Hidden dimension: approximately 2048
Attention heads: 32
Feed-forward inner dimension: ~8192
Attention: multilingual self-attention with rotary position embeddings

The transition from 12B to 15B parameters was achieved exclusively by depth-scaling—specifically, by duplicating selected intermediate transformer layers without altering the width (i.e., hidden size or MLP expansion). This particular choice was driven by experimental observations: width-scaling increased training instability in early trials, whereas layer duplication consistently produced low initial loss and smooth convergence. The final 15B model comprises three additional duplicated transformer layers, maintaining identical per-layer structure and dimensions across the model.

There are no further architectural modifications relative to the base—no mixture-of-experts (MoE), sparse attention, or novel sub-blocks. Retaining the vanilla Mistral block architecture was intended to streamline checkpoint merging and model averaging throughout the subsequent training pipeline.

2. Four-Stage Training Pipeline

The Apriel-Nemotron-15B-Thinker training process comprises four major stages, each contributing distinct competencies and specialization:

Stage 1: Base Model Upscaling

Loss: next-token prediction (MLM/causal LM objective)
Dataset: ~100B tokens, sourced evenly between a Mistral-Nemo style web/text/code mixture and additional curated data (high-quality web content, StackExchange, multilingual code, technical/mathematical literature)
Batch size: 768 × 16K sequences (~12M tokens per batch), with a fixed learning rate of $5 \times 10^{-5}$ .
Result: the "Apriel-Nemotron-15B-Base" checkpoint serves as the foundation for downstream specialization.

Stage 2: Continual Pre-Training (CPT)

Objective: to enhance complex reasoning and chain-of-thought (CoT) capacities.
Composition: data are packed to 16K tokens with cross-document attention masking; split as 60% explicit reasoning examples, 25% chain-of-thought traces, and 15% generic replay from Stage 1.
Learning schedule: 68B tokens processed, AdamW optimizer with weight decay 0.1, initial LR $5 \times 10^{-5}$ following a cosine decay to $5 \times 10^{-6}$ , 10% linear warmup.
Model merging: three evenly-spaced checkpoints are averaged to produce "Apriel-CPT".

Stage 3: Supervised Fine-Tuning (SFT)

Objective: explicit training for reasoning traces, advanced mathematical problem-solving, function calling, retrieval-augmented generation (RAG), and formal coding tasks.
Datasets:
- Balanced SFT: 1M samples over three epochs (mixed instructions, retrieval, coding, multi-turn dialogue)
- Math SFT: 200K samples over eight epochs (emphasis on advanced/olympiad math, ≥3 generated solutions per prompt)
- Small reasoning SFT: 15K samples, used primarily for comparative evaluations of CPT effectiveness
Training: sequences of up to 32K tokens, with supervised loss computed only for tokens corresponding to the "assistant" role.
Model merging: the best checkpoints from balanced (A) and math-specialized (B) SFT are equally combined (C) for further use.

Stage 4: Reinforcement Learning with Group Relative Policy Optimization (GRPO)

RL Algorithm: GRPO, optimizing the reward-augmented log-likelihood penalized by KL divergence from the SFT prior.

$\max_{\theta}\; \mathbb{E}_{x\sim\mathcal{D},\,a\sim\pi_\theta(\cdot|x)}\Big[ R(x,a) \;-\;\beta\; \mathrm{KL}\bigl[\pi_\theta(\cdot|x)\,\|\,\pi_{\mathrm{SFT}(\cdot|x)}\bigr] \Big]$

Rewards:
- Output formatting adherence
- Correctness in advanced math across filtered prompts (18K kept where at least one of eight samples was correct and three were wrong)
- Precise instruction following (14K synthetic, compositional prompts)
- Python/JS code: proportion of test cases passed
- Agentic tool invocation tasks (32K single-turn prompts)
Sampling: eight candidate completions per prompt (temperature = 1.0, top- $p$ = 0.95), batch size 512, learning rate of $1 \times 10^{-6}$ , KL penalty coefficient 0.001.
Model merges: multiple stages (e.g., $E = 0.5C + 0.5D$ ), culminating in the final model as a convex combination of the last three RL checkpoints.

3. Memory Footprint and Computational Efficiency

The model's central design goal is a significant reduction in resource consumption relative to 30–32B parameter baselines. Empirical measurements demonstrate:

Apriel-15B supports inference and training on a single NVIDIA H100 (80GB) or dual consumer-grade GPUs (2 × 48GB), while comparable 32B models require at least dual 80GB H100s or advanced sharding.
Memory usage (fp16, optimizer, activations):
- Apriel-15B: approximately 90GB peak, reduced to 60GB with ZeRO-1 or data parallelism
- 32B models: typically ~180GB, requiring specialized hardware or heavy pipeline parallelism

Model	Parameters	Raw fp16 Weights	Total Memory (with optimizer)	Typical Inference GPU
O1-mini (est.)	30 B	60 GB	~120 GB	2 × 80 GB
QWQ-32B	32 B	64 GB	~130 GB	2 × 80 GB
ExaOne-Deep-32B	32 B	64 GB	~130 GB	2 × 80 GB
Apriel-Nemotron-15B	15 B	30 GB	~ 60 GB	1 × 80 GB or 2 × 48 GB

This resource profile enables deployments in environments where larger models are infeasible, such as on-premise or air-gapped enterprise settings.

4. Empirical Benchmarking and Performance

Apriel-Nemotron-15B-Thinker was evaluated under zero-shot settings (sampling temperature T=0.6, max 32K tokens) across a broad suite of enterprise and academic reasoning benchmarks. Key results include:

Enterprise Benchmarks

Benchmark	Flash-3	Nano-8B	QWQ-32B	O1-mini	ExaOne-32B	Apriel-15B
MBPP (pass@1)	80.2%	73.8%	88.2%	93.1%	76.8%	85.8%
BFCL-live-V2	77.4%	54.2%	79.0%	81.0%	75.4%	75.4%
Enterprise RAG	57.9%	11.1%	65.2%	66.5%	52.1%	69.2%
MT-Bench	8.43	7.43	8.46	8.38	8.39	8.57
MixEval	79.1%	62.1%	77.3%	82.9%	80.6%	82.8%
IFEval	81.6%	69.8%	82.8%	79.5%	83.1%	84.6%
MultiChallenge	24.5%	16.1%	37.7%	30.8%	38.5%	36.6%

Apriel-15B leads on MT-Bench and IFEval, is second on MBPP and MixEval, and is competitive on other benchmarks relative to 32B models.

Academic Reasoning Benchmarks

Benchmark	Flash-3	Nano-8B	QWQ-32B	O1-mini	ExaOne-32B	Distill-QWQ-32B	Apriel-15B
GPQA-Diamond	53.5%	54.0%	66.7%	60.0%	65.2%	71.5%	57.4%
MATH-500	84.0%	89.6%	90.8%	90.0%	91.6%	97.5%	91.6%
AIME’24	38.7%	62.0%	81.3%	63.6%	76.0%	79.8%	73.3%
AIME’25	25.3%	48.7%	68.7%	54.8%	64.7%	66.8%	60.0%
MMLU-Pro	66.9%	61.5%	79.0%	80.3%	73.9%	84.0%	73.4%
AMC23	83.1%	93.5%	98.5%	92.5%	95.0%	99.0%	95.0%
LiveCodeBench	44.8%	53.2%	65.9%	53.8%	62.4%	65.9%	54.6%

Apriel-15B demonstrates strong performance on math-intensive tasks (MATH-500, AMC23) and is competitive on AIME challenges, while lagging the largest distillation models on code execution.

Token Efficiency

On problem sets such as AIME-24, AIME-25, GPQA-Diamond, and MATH-500, Apriel-15B uses 8–10K tokens for reasoning ("thinking") per instance, whereas comparably performing 32B models consume 13–20K tokens per instance—a 30–50% reduction in token usage.

5. Ablation Studies and Training Regimen Analysis

Detailed ablation experiments delineate the effect of each stage in the training pipeline:

Upscaling vs. Training from Scratch

Layer-doubling and pretraining on 100B tokens deliver 5–10% absolute gains over the 12B Mistral baseline on GSM8K, HumanEval, BBH, and MMLU.

Continual Pre-Training (CPT) Impact

Benchmark	Before CPT	After CPT	Change
Arc	68.5%	65.3%	−3.2
GSM8K	78.5%	82.0%	+3.5
HumanEval	85.3%	83.5%	−1.8
BBH	57.5%	57.2%	−0.3
GPQA	31.6%	32.7%	+1.1
Minerva Math	45.3%	49.2%	+3.9
IF-Eval	22.4%	36.8%	+14.4
Average	57.3%	58.1%	+0.8

CPT produces consistent gains for math/logic tasks (Minerva Math, GSM8K, IF-Eval), while causing minor regressions on some common-sense tasks.

SFT on CPT vs. Pre-CPT

Benchmark	SFT before CPT	SFT after CPT	Change
GPQA Diamond	37.2%	46.5%	+9.3
MATH-500	80.4%	90.8%	+10.4
AIME-24	16.0%	58.0%	+42.0
AIME-25	18.4%	45.99%	+27.6
AMC23	59.5%	96.0%	+36.5

This demonstrates that CPT can substantially enhance the effectiveness of SFT, especially for advanced mathematical reasoning.

6. Implications, Use Cases, and Open Questions

Apriel-Nemotron-15B-Thinker’s halved resource requirements vis-à-vis mainstream 32B models enable practical deployment on a single H100 or affordable dual-GPU configurations. This reduction directly translates into decreased inference latency, lower cost, and more accessible on-premise and air-gapped deployments, with particular suitability for secure retrieval-augmented generation (RAG), automated workflow orchestration, and code generation scenarios.

Despite strong performance across reasoning and mathematical tasks, there are persistent limitations, including:

Inferior code execution ability on benchmarks such as LiveCodeBench (54.6%) compared to distilled 32B models (>65%)
Absence of width scaling or sparse/MoE layers, which may yield additional gains at the 15B parameter scale
RL rewards remain hand-crafted; use of learned preference models may further improve instruction following and generative alignment
Validation of quantized inference for reduced-precision (8-bit, 4-bit) operation remains incomplete

A plausible implication is that future research could address these limitations by exploring width scaling stability, automated reward design, and quantized inference strategies, potentially advancing the efficiency frontier established by Apriel-Nemotron-15B-Thinker.

PDF Markdown Chat (Pro)

References (1)

Apriel-Nemotron-15B-Thinker (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Apriel-Nemotron-15B-Thinker.