Nanbeige4-3B: 3B-Scale Transformer Model
- Nanbeige4-3B is a family of small-scale, high-performing language models based on a decoder-only Transformer architecture with approximately 3 billion parameters.
- It utilizes a Fine-Grained Warmup-Stable-Decay scheduler, multi-stage SFT, and dual preference distillation to enhance token-level and sequence-level performance.
- Empirical evaluations show that Nanbeige4-3B outperforms larger models on benchmarks like AIME2024 and GPQA-Diamond, highlighting its effective scaling through methodological innovations.
Nanbeige4-3B is a family of small-scale, high-performing LLMs based on the decoder-only Transformer architecture. Designed to extend the scaling law frontier for small LLMs, Nanbeige4-3B demonstrates that with sophisticated data curation, curriculum strategies, and targeted post-training, models of this scale (≈3B parameters) can achieve or surpass the performance of significantly larger models on a variety of challenging benchmarks. All gains over prior models are attributed to pretraining data quality, specialized fine-tuning procedures, novel distillation objectives, and reinforcement learning pipelines, rather than to architectural innovations (Yang et al., 6 Dec 2025).
1. Model Structure and Representation
Nanbeige4-3B is implemented as a decoder-only Transformer encompassing approximately 3 billion parameters. The technical report does not specify the exact architectural breakdown (number of layers, hidden dimension, or attention heads). By analogy with other models in this class, such as Qwen3-4B, a typical configuration would be 30–36 layers, a hidden size close to 4096, and 32 attention heads. Nanbeige4-3B employs Rotary Position Embeddings (RoPE) extended to a context length of 64K tokens using the Adjusting Base Frequency (ABF) technique [xiong2023effectivelongcontextscalingfoundation]. No new architectural blocks—such as alternate attention mechanisms or mixture-of-experts—are introduced; improvements are entirely methodological and data-driven.
2. Pretraining Methodology
2.1 Data Collection and Filtering
The pretraining corpus integrates 23T tokens drawn from web pages, scholarly PDFs, books, source code, and synthetic data (e.g., QA, chain-of-thought, textbook-style samples). A hybrid filtering pipeline combines multi-dimensional quality tagging (20 distinct quality scores on a 0–9 scale, capturing properties such as knowledge density, reasoning depth, and fluency) and retrieval-based scoring relative to a high-quality reference set. This filtering yields 12.5T "good" tokens, with 6.5T further up-sampled (≥2×) to compose the final corpus.
2.2 Fine-Grained Warmup-Stable-Decay (FG-WSD) Scheduler
Pretraining employs a Fine-Grained Warmup-Stable-Decay scheduler comprising four sequential phases:
| Phase | Tokens | Description |
|---|---|---|
| Warmup | 0.1T | LR ramps from 0 to μ_max |
| Diversity-Enriched Stable | 12.4T | Constant LR (μ_max), mixed‐quality data; quality weighting shifts |
| High-Quality Stable | 6.5T | Constant LR (μ_max), only top‐quality data |
| Decay | 4T | LR decays from μ_max to μ_min; only high-quality data |
The learning rate follows: with and . Within the Diversity-Enriched Stable phase, stagewise mixture weights begin with an MQ:HQ ratio of 2:1 and gradually shift to fully high-quality data at the end of this phase (formally, for the initial stage, then 1.0 thereafter, with in the toy experiment and appropriately scaled for full model pretraining).
3. Supervised Fine-Tuning and Data Refinement
SFT Regime
Nanbeige4-3B-Thinking (the "capstone" model in the family) undergoes two SFT stages:
- Cold-Start SFT: 30M QA samples (50% math, 30% science, 20% code), context length 32K.
- Full SFT: Diversified instruction mixing (40% reasoning, 30% QA/writing, 20% agent-style, 10% code), context length 64K.
Deliberative Generation Refinement
Each instruction is paired with a multi-dimensional checklist (criteria: correctness, completeness, consistency, executability, safety). Candidate completions are generated by the model and one or more teachers. Each sample is evaluated against using an automatic evaluator, feedback is appended to the prompt, and new completions are generated iteratively until improvement saturates.
Chain-of-Thought (CoT) Reconstruction
The best solution is further expanded by a separate chain-completion model, which produces a summary chain and detailed CoT . The final SFT sample is
where "" and "" denote concatenation.
4. Preference Distillation via Dual Objectives
To align Nanbeige4-3B to both token-level likelihoods and sequence-level preferences, Dual Preference Distillation (DPD) is introduced. Nanbeige4-3B student () is distilled from a teacher model () using two loss terms:
- Token-Level Distillation: For the best sample (from the teacher) and a negative sample (from the student), token distillation loss:
- Sequence-Level DPO Margin Loss: With ,
where is a margin hyperparameter.
The joint DPD objective is
This procedure enables the student to match the teacher both locally (token-wise distributions) and globally (ranked sequence preference), as in DPO [rafailov2023dpo].
5. Reinforcement Learning Specialization
Nanbeige4-3B utilizes a three-stage, on-policy reinforcement learning framework based on GRPO [shao2024deepseekmathpushinglimitsmathematical], with policy truncation masks as in DAPO [yu2025dapoopensourcellmreinforcement], omitting KL regularization.
Pre-stage Filtering: Before each RL stage, the prevailing policy is applied to data; only samples with an avg@16 pass-rate in [10%, 90%] are retained.
RL Stages:
- STEM RL: Math and science tasks, with binary reward via Python-based programmatic verifiers.
- Coding RL: Multi-language programming, reward if generated code passes all private test cases; otherwise 0.
- Human Preference RL: Creative writing/dialogue, reward via a pairwise comparison model trained to match human judgments.
The RL objective is
Learning rates and batch sizes are held constant throughout.
6. Empirical Performance and Analysis
6.1 Benchmark Evaluation
Nanbeige4-3B is evaluated against the Qwen3 (4B–32B) series across mathematics, science, coding, and human preference alignment tasks:
| Benchmark | Qwen3-4B | Qwen3-8B | Qwen3-14B | Qwen3-30B-A3B | Qwen3-32B | Nanbeige4-3B |
|---|---|---|---|---|---|---|
| AIME2025 | 81.3 | 67.3 | 70.4 | 85.0 | 72.9 | 85.6 |
| AIME2024 | 83.3 | 76.0 | 79.3 | 89.2 | 81.4 | 90.4 |
| GPQA-Diamond | 67.2 | 62.0 | 64.0 | 73.4 | 68.7 | 82.2 |
| SuperGPQA | 46.7 | 39.1 | 46.8 | 56.8 | 54.1 | 53.2 |
| BFCL-V4 | 44.9 | 42.2 | 45.4 | 48.6 | 47.9 | 53.8 |
| FullstackBench | 47.1 | 51.5 | 55.7 | 54.4 | 58.2 | 48.0 |
| ArenaHard-V2 | 40.5 | 26.4 | 39.9 | 60.0 | 48.4 | 60.0 |
| Multi-Challenge | 41.8 | 35.8 | 36.4 | 49.4 | 39.2 | 41.2 |
In mathematical and scientific reasoning tasks, Nanbeige4-3B generally surpasses all comparably sized models and equals or exceeds much larger configurations.
6.2 Ablation Effects
- FG-WSD outperforms vanilla WSD by 5–7 points on hard reasoning benchmarks (e.g., GSM8K +7.2, CMATH +5.0, BBH +2.3 on a 1B-parameter model at 1T tokens).
- SFT with deliberative refinement and CoT reconstruction improves Arena-Hard-V2 by 16% absolute.
- DPD provides relative gains: AIME24/25 +8%, GPQA +10%, BFCL-V4 +30%.
- RL stages yield further domain-specific improvements: STEM RL (+2–3 AIME), Coding RL (+4 points pass@1), Preference RL (+5% Arena-Hard).
7. Mechanistic Insights and Open Questions
The chief performance gains are attributable to (a) FG-WSD data curriculum (+5–7 points over vanilla WSD for complex reasoning), (b) iterative SFT refinement (+16% Arena-Hard), (c) dual-level distillation strategies (+8–30% across tasks), and (d) targeted RL specialization (+3–5 points per domain). No novel model structures contribute to these improvements.
Limitations include nondisclosure of detailed architectural hyperparameters, significant reliance on large teacher models and intricate SFT criteria (potentially restricting low-resource adaptation), and substantial computational overhead in filtering and multi-stage training. A plausible implication is that further progress may require streamlining the data- and compute-intensive pipeline or integrating more sample-efficient adaptation techniques. Future work is envisioned to push small-model capabilities further into autonomous software engineering, research agent tasks, and sophisticated multi-tool environments.
Further details, model checkpoints, and benchmark breakdowns are available at https://huggingface.co/Nanbeige and in the full technical report (Yang et al., 6 Dec 2025).