Apriel-Nemotron-15B Thinker
- Apriel-Nemotron-15B-Thinker is a 15-billion parameter transformer that uses depth-scaling of the Mistral-Nemo architecture to achieve robust reasoning and efficient training.
- It employs a four-stage pipeline—including upscaling, continual pre-training, supervised fine-tuning, and GRPO reinforcement learning—to optimize advanced reasoning and automation tasks.
- The model halves memory and compute requirements relative to 32B models, enabling on-premise deployment for code generation, complex mathematics, and retrieval-augmented applications.
Apriel-Nemotron-15B-Thinker is a 15-billion parameter transformer model developed within the ServiceNow Apriel SLM series, designed to deliver enterprise-grade reasoning capabilities—spanning code, mathematics, and diverse workflow automation—while halving the memory and compute requirements typical of contemporary 32-billion parameter models. Its design, training regimen, and evaluation emphasize resource efficiency, modular upscaling, and robust performance across standard academic and industrial benchmarks (Radhakrishna et al., 13 Aug 2025).
1. Model Architecture and Upscaling
Apriel-Nemotron-15B-Thinker iteratively extends the Mistral-Nemo-Base-2407 architecture, which is in itself a 12-billion parameter (12B) encoder-decoder transformer. The main architectural parameters are:
- Hidden dimension: approximately 2048
- Attention heads: 32
- Feed-forward inner dimension: ~8192
- Attention: multilingual self-attention with rotary position embeddings
The transition from 12B to 15B parameters was achieved exclusively by depth-scaling—specifically, by duplicating selected intermediate transformer layers without altering the width (i.e., hidden size or MLP expansion). This particular choice was driven by experimental observations: width-scaling increased training instability in early trials, whereas layer duplication consistently produced low initial loss and smooth convergence. The final 15B model comprises three additional duplicated transformer layers, maintaining identical per-layer structure and dimensions across the model.
There are no further architectural modifications relative to the base—no mixture-of-experts (MoE), sparse attention, or novel sub-blocks. Retaining the vanilla Mistral block architecture was intended to streamline checkpoint merging and model averaging throughout the subsequent training pipeline.
2. Four-Stage Training Pipeline
The Apriel-Nemotron-15B-Thinker training process comprises four major stages, each contributing distinct competencies and specialization:
Stage 1: Base Model Upscaling
- Loss: next-token prediction (MLM/causal LM objective)
- Dataset: ~100B tokens, sourced evenly between a Mistral-Nemo style web/text/code mixture and additional curated data (high-quality web content, StackExchange, multilingual code, technical/mathematical literature)
- Batch size: 768 × 16K sequences (~12M tokens per batch), with a fixed learning rate of .
- Result: the "Apriel-Nemotron-15B-Base" checkpoint serves as the foundation for downstream specialization.
Stage 2: Continual Pre-Training (CPT)
- Objective: to enhance complex reasoning and chain-of-thought (CoT) capacities.
- Composition: data are packed to 16K tokens with cross-document attention masking; split as 60% explicit reasoning examples, 25% chain-of-thought traces, and 15% generic replay from Stage 1.
- Learning schedule: 68B tokens processed, AdamW optimizer with weight decay 0.1, initial LR following a cosine decay to , 10% linear warmup.
- Model merging: three evenly-spaced checkpoints are averaged to produce "Apriel-CPT".
Stage 3: Supervised Fine-Tuning (SFT)
- Objective: explicit training for reasoning traces, advanced mathematical problem-solving, function calling, retrieval-augmented generation (RAG), and formal coding tasks.
- Datasets:
- Balanced SFT: 1M samples over three epochs (mixed instructions, retrieval, coding, multi-turn dialogue)
- Math SFT: 200K samples over eight epochs (emphasis on advanced/olympiad math, ≥3 generated solutions per prompt)
- Small reasoning SFT: 15K samples, used primarily for comparative evaluations of CPT effectiveness
- Training: sequences of up to 32K tokens, with supervised loss computed only for tokens corresponding to the "assistant" role.
- Model merging: the best checkpoints from balanced (A) and math-specialized (B) SFT are equally combined (C) for further use.
Stage 4: Reinforcement Learning with Group Relative Policy Optimization (GRPO)
- RL Algorithm: GRPO, optimizing the reward-augmented log-likelihood penalized by KL divergence from the SFT prior.
- Rewards:
- Output formatting adherence
- Correctness in advanced math across filtered prompts (18K kept where at least one of eight samples was correct and three were wrong)
- Precise instruction following (14K synthetic, compositional prompts)
- Python/JS code: proportion of test cases passed
- Agentic tool invocation tasks (32K single-turn prompts)
- Sampling: eight candidate completions per prompt (temperature = 1.0, top- = 0.95), batch size 512, learning rate of , KL penalty coefficient 0.001.
- Model merges: multiple stages (e.g., ), culminating in the final model as a convex combination of the last three RL checkpoints.
3. Memory Footprint and Computational Efficiency
The model's central design goal is a significant reduction in resource consumption relative to 30–32B parameter baselines. Empirical measurements demonstrate:
- Apriel-15B supports inference and training on a single NVIDIA H100 (80GB) or dual consumer-grade GPUs (2 × 48GB), while comparable 32B models require at least dual 80GB H100s or advanced sharding.
- Memory usage (fp16, optimizer, activations):
- Apriel-15B: approximately 90GB peak, reduced to 60GB with ZeRO-1 or data parallelism
- 32B models: typically ~180GB, requiring specialized hardware or heavy pipeline parallelism
| Model | Parameters | Raw fp16 Weights | Total Memory (with optimizer) | Typical Inference GPU |
|---|---|---|---|---|
| O1-mini (est.) | 30 B | 60 GB | ~120 GB | 2 × 80 GB |
| QWQ-32B | 32 B | 64 GB | ~130 GB | 2 × 80 GB |
| ExaOne-Deep-32B | 32 B | 64 GB | ~130 GB | 2 × 80 GB |
| Apriel-Nemotron-15B | 15 B | 30 GB | ~ 60 GB | 1 × 80 GB or 2 × 48 GB |
This resource profile enables deployments in environments where larger models are infeasible, such as on-premise or air-gapped enterprise settings.
4. Empirical Benchmarking and Performance
Apriel-Nemotron-15B-Thinker was evaluated under zero-shot settings (sampling temperature T=0.6, max 32K tokens) across a broad suite of enterprise and academic reasoning benchmarks. Key results include:
Enterprise Benchmarks
| Benchmark | Flash-3 | Nano-8B | QWQ-32B | O1-mini | ExaOne-32B | Apriel-15B |
|---|---|---|---|---|---|---|
| MBPP (pass@1) | 80.2% | 73.8% | 88.2% | 93.1% | 76.8% | 85.8% |
| BFCL-live-V2 | 77.4% | 54.2% | 79.0% | 81.0% | 75.4% | 75.4% |
| Enterprise RAG | 57.9% | 11.1% | 65.2% | 66.5% | 52.1% | 69.2% |
| MT-Bench | 8.43 | 7.43 | 8.46 | 8.38 | 8.39 | 8.57 |
| MixEval | 79.1% | 62.1% | 77.3% | 82.9% | 80.6% | 82.8% |
| IFEval | 81.6% | 69.8% | 82.8% | 79.5% | 83.1% | 84.6% |
| MultiChallenge | 24.5% | 16.1% | 37.7% | 30.8% | 38.5% | 36.6% |
Apriel-15B leads on MT-Bench and IFEval, is second on MBPP and MixEval, and is competitive on other benchmarks relative to 32B models.
Academic Reasoning Benchmarks
| Benchmark | Flash-3 | Nano-8B | QWQ-32B | O1-mini | ExaOne-32B | Distill-QWQ-32B | Apriel-15B |
|---|---|---|---|---|---|---|---|
| GPQA-Diamond | 53.5% | 54.0% | 66.7% | 60.0% | 65.2% | 71.5% | 57.4% |
| MATH-500 | 84.0% | 89.6% | 90.8% | 90.0% | 91.6% | 97.5% | 91.6% |
| AIME’24 | 38.7% | 62.0% | 81.3% | 63.6% | 76.0% | 79.8% | 73.3% |
| AIME’25 | 25.3% | 48.7% | 68.7% | 54.8% | 64.7% | 66.8% | 60.0% |
| MMLU-Pro | 66.9% | 61.5% | 79.0% | 80.3% | 73.9% | 84.0% | 73.4% |
| AMC23 | 83.1% | 93.5% | 98.5% | 92.5% | 95.0% | 99.0% | 95.0% |
| LiveCodeBench | 44.8% | 53.2% | 65.9% | 53.8% | 62.4% | 65.9% | 54.6% |
Apriel-15B demonstrates strong performance on math-intensive tasks (MATH-500, AMC23) and is competitive on AIME challenges, while lagging the largest distillation models on code execution.
Token Efficiency
On problem sets such as AIME-24, AIME-25, GPQA-Diamond, and MATH-500, Apriel-15B uses 8–10K tokens for reasoning ("thinking") per instance, whereas comparably performing 32B models consume 13–20K tokens per instance—a 30–50% reduction in token usage.
5. Ablation Studies and Training Regimen Analysis
Detailed ablation experiments delineate the effect of each stage in the training pipeline:
Upscaling vs. Training from Scratch
Layer-doubling and pretraining on 100B tokens deliver 5–10% absolute gains over the 12B Mistral baseline on GSM8K, HumanEval, BBH, and MMLU.
Continual Pre-Training (CPT) Impact
| Benchmark | Before CPT | After CPT | Change |
|---|---|---|---|
| Arc | 68.5% | 65.3% | −3.2 |
| GSM8K | 78.5% | 82.0% | +3.5 |
| HumanEval | 85.3% | 83.5% | −1.8 |
| BBH | 57.5% | 57.2% | −0.3 |
| GPQA | 31.6% | 32.7% | +1.1 |
| Minerva Math | 45.3% | 49.2% | +3.9 |
| IF-Eval | 22.4% | 36.8% | +14.4 |
| Average | 57.3% | 58.1% | +0.8 |
CPT produces consistent gains for math/logic tasks (Minerva Math, GSM8K, IF-Eval), while causing minor regressions on some common-sense tasks.
SFT on CPT vs. Pre-CPT
| Benchmark | SFT before CPT | SFT after CPT | Change |
|---|---|---|---|
| GPQA Diamond | 37.2% | 46.5% | +9.3 |
| MATH-500 | 80.4% | 90.8% | +10.4 |
| AIME-24 | 16.0% | 58.0% | +42.0 |
| AIME-25 | 18.4% | 45.99% | +27.6 |
| AMC23 | 59.5% | 96.0% | +36.5 |
This demonstrates that CPT can substantially enhance the effectiveness of SFT, especially for advanced mathematical reasoning.
6. Implications, Use Cases, and Open Questions
Apriel-Nemotron-15B-Thinker’s halved resource requirements vis-à-vis mainstream 32B models enable practical deployment on a single H100 or affordable dual-GPU configurations. This reduction directly translates into decreased inference latency, lower cost, and more accessible on-premise and air-gapped deployments, with particular suitability for secure retrieval-augmented generation (RAG), automated workflow orchestration, and code generation scenarios.
Despite strong performance across reasoning and mathematical tasks, there are persistent limitations, including:
- Inferior code execution ability on benchmarks such as LiveCodeBench (54.6%) compared to distilled 32B models (>65%)
- Absence of width scaling or sparse/MoE layers, which may yield additional gains at the 15B parameter scale
- RL rewards remain hand-crafted; use of learned preference models may further improve instruction following and generative alignment
- Validation of quantized inference for reduced-precision (8-bit, 4-bit) operation remains incomplete
A plausible implication is that future research could address these limitations by exploring width scaling stability, automated reward design, and quantized inference strategies, potentially advancing the efficiency frontier established by Apriel-Nemotron-15B-Thinker.