Nanbeige4-3B: 3B-Scale Transformer Model

Updated 13 December 2025

Nanbeige4-3B is a family of small-scale, high-performing language models based on a decoder-only Transformer architecture with approximately 3 billion parameters.
It utilizes a Fine-Grained Warmup-Stable-Decay scheduler, multi-stage SFT, and dual preference distillation to enhance token-level and sequence-level performance.
Empirical evaluations show that Nanbeige4-3B outperforms larger models on benchmarks like AIME2024 and GPQA-Diamond, highlighting its effective scaling through methodological innovations.

Nanbeige4-3B is a family of small-scale, high-performing LLMs based on the decoder-only Transformer architecture. Designed to extend the scaling law frontier for small LLMs, Nanbeige4-3B demonstrates that with sophisticated data curation, curriculum strategies, and targeted post-training, models of this scale (≈3B parameters) can achieve or surpass the performance of significantly larger models on a variety of challenging benchmarks. All gains over prior models are attributed to pretraining data quality, specialized fine-tuning procedures, novel distillation objectives, and reinforcement learning pipelines, rather than to architectural innovations (Yang et al., 6 Dec 2025).

1. Model Structure and Representation

Nanbeige4-3B is implemented as a decoder-only Transformer encompassing approximately 3 billion parameters. The technical report does not specify the exact architectural breakdown (number of layers, hidden dimension, or attention heads). By analogy with other models in this class, such as Qwen3-4B, a typical configuration would be 30–36 layers, a hidden size close to 4096, and 32 attention heads. Nanbeige4-3B employs Rotary Position Embeddings (RoPE) extended to a context length of 64K tokens using the Adjusting Base Frequency (ABF) technique [xiong2023effectivelongcontextscalingfoundation]. No new architectural blocks—such as alternate attention mechanisms or mixture-of-experts—are introduced; improvements are entirely methodological and data-driven.

2. Pretraining Methodology

2.1 Data Collection and Filtering

The pretraining corpus integrates 23T tokens drawn from web pages, scholarly PDFs, books, source code, and synthetic data (e.g., QA, chain-of-thought, textbook-style samples). A hybrid filtering pipeline combines multi-dimensional quality tagging (20 distinct quality scores on a 0–9 scale, capturing properties such as knowledge density, reasoning depth, and fluency) and retrieval-based scoring relative to a high-quality reference set. This filtering yields 12.5T "good" tokens, with 6.5T further up-sampled (≥2×) to compose the final corpus.

2.2 Fine-Grained Warmup-Stable-Decay (FG-WSD) Scheduler

Pretraining employs a Fine-Grained Warmup-Stable-Decay scheduler comprising four sequential phases:

Phase	Tokens	Description
Warmup	0.1T	LR ramps from 0 to μ_max
Diversity-Enriched Stable	12.4T	Constant LR (μ_max), mixed‐quality data; quality weighting shifts
High-Quality Stable	6.5T	Constant LR (μ_max), only top‐quality data
Decay	4T	LR decays from μ_max to μ_min; only high-quality data

The learning rate $\mu(t)$ follows: $\mu(t) = \begin{cases} \frac{t}{T_{\rm warm}}\;\mu_{\max}, & 0 \le t < T_{\rm warm} \ \mu_{\max}, & T_{\rm warm} \le t < T_{\rm warm} + T_{\rm div} + T_{\rm HQ} \ \mu_{\max} \times \left( 1 - \frac{t - (T_{\rm warm} + T_{\rm div} + T_{\rm HQ})}{T_{\rm decay}} \right), & \text{otherwise} \end{cases}$ with $\mu_{\max} = 4.5 \times 10^{-4}$ and $\mu_{\min} = 1.5 \times 10^{-6}$ . Within the Diversity-Enriched Stable phase, stagewise mixture weights begin with an MQ:HQ ratio of 2:1 and gradually shift to fully high-quality data at the end of this phase (formally, $w_{\mathrm{HQ}}(s) = \alpha$ for the initial stage, then 1.0 thereafter, with $\alpha = 1/3$ in the toy experiment and appropriately scaled for full model pretraining).

SFT Regime

Nanbeige4-3B-Thinking (the "capstone" model in the family) undergoes two SFT stages:

Cold-Start SFT: 30M QA samples (50% math, 30% science, 20% code), context length 32K.
Full SFT: Diversified instruction mixing (40% reasoning, 30% QA/writing, 20% agent-style, 10% code), context length 64K.

Each instruction $I$ is paired with a multi-dimensional checklist $C_I$ (criteria: correctness, completeness, consistency, executability, safety). Candidate completions $\{y_i\}$ are generated by the model and one or more teachers. Each sample is evaluated against $C_I$ using an automatic evaluator, feedback $F_i$ is appended to the prompt, and new completions are generated iteratively until improvement saturates.

Chain-of-Thought (CoT) Reconstruction

The best solution $\hat{y}$ is further expanded by a separate chain-completion model, which produces a summary chain $s_1$ and detailed CoT $s_2$ . The final SFT sample is

$(I,\, s_1\!\parallel\!s_2 \| \hat{y})$

where " $\parallel$ " and " $\|$ " denote concatenation.

4. Preference Distillation via Dual Objectives

To align Nanbeige4-3B to both token-level likelihoods and sequence-level preferences, Dual Preference Distillation (DPD) is introduced. Nanbeige4-3B student ( $S_\theta$ ) is distilled from a teacher model ( $T$ ) using two loss terms:

Token-Level Distillation: For the best sample $y^+$ (from the teacher) and a negative sample $y^-$ (from the student), token distillation loss: $\mathcal{L}_{\rm KD}(y) = \sum_{t=1}^{|y|} \mathrm{KL}(p_T(y_t | y_{<t}) \| p_S(y_t | y_{<t}))$
Sequence-Level DPO Margin Loss: With $r_S(y) = \log p_S(y)$ ,

$\mathcal{L}_{\rm DPO}(y^+, y^-) = -\log \sigma (r_S(y^+) - r_S(y^-) - \delta)$

where $\delta$ is a margin hyperparameter.

The joint DPD objective is

$\mathcal{L}_{\rm DPD} = \mathcal{L}_{\rm KD}(y^+) + \beta \, \mathcal{L}_{\rm KD}(y^-) + \lambda \, \mathcal{L}_{\rm DPO}(y^+, y^-)$

This procedure enables the student to match the teacher both locally (token-wise distributions) and globally (ranked sequence preference), as in DPO [rafailov2023dpo].

5. Reinforcement Learning Specialization

Nanbeige4-3B utilizes a three-stage, on-policy reinforcement learning framework based on GRPO [shao2024deepseekmathpushinglimitsmathematical], with policy truncation masks as in DAPO [yu2025dapoopensourcellmreinforcement], omitting KL regularization.

Pre-stage Filtering: Before each RL stage, the prevailing policy $\pi_{\text{old}}$ is applied to data; only samples with an avg@16 pass-rate in [10%, 90%] are retained.

RL Stages:

STEM RL: Math and science tasks, with binary reward via Python-based programmatic verifiers.
Coding RL: Multi-language programming, reward $= 1$ if generated code passes all private test cases; otherwise 0.
Human Preference RL: Creative writing/dialogue, reward via a pairwise comparison model $f_{\text{pair}}(y_{\text{ref}}, y_{\text{gen}}) \in [0, 1]$ trained to match human judgments.

The RL objective is

$J(\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot | I)} [R(I, y)]$

Learning rates and batch sizes are held constant throughout.

6. Empirical Performance and Analysis

6.1 Benchmark Evaluation

Nanbeige4-3B is evaluated against the Qwen3 (4B–32B) series across mathematics, science, coding, and human preference alignment tasks:

Benchmark	Qwen3-4B	Qwen3-8B	Qwen3-14B	Qwen3-30B-A3B	Qwen3-32B	Nanbeige4-3B
AIME2025	81.3	67.3	70.4	85.0	72.9	85.6
AIME2024	83.3	76.0	79.3	89.2	81.4	90.4
GPQA-Diamond	67.2	62.0	64.0	73.4	68.7	82.2
SuperGPQA	46.7	39.1	46.8	56.8	54.1	53.2
BFCL-V4	44.9	42.2	45.4	48.6	47.9	53.8
FullstackBench	47.1	51.5	55.7	54.4	58.2	48.0
ArenaHard-V2	40.5	26.4	39.9	60.0	48.4	60.0
Multi-Challenge	41.8	35.8	36.4	49.4	39.2	41.2

In mathematical and scientific reasoning tasks, Nanbeige4-3B generally surpasses all comparably sized models and equals or exceeds much larger configurations.

6.2 Ablation Effects

FG-WSD outperforms vanilla WSD by 5–7 points on hard reasoning benchmarks (e.g., GSM8K +7.2, CMATH +5.0, BBH +2.3 on a 1B-parameter model at 1T tokens).
SFT with deliberative refinement and CoT reconstruction improves Arena-Hard-V2 by 16% absolute.
DPD provides relative gains: AIME24/25 +8%, GPQA +10%, BFCL-V4 +30%.
RL stages yield further domain-specific improvements: STEM RL (+2–3 AIME), Coding RL (+4 points pass@1), Preference RL (+5% Arena-Hard).

7. Mechanistic Insights and Open Questions

The chief performance gains are attributable to (a) FG-WSD data curriculum (+5–7 points over vanilla WSD for complex reasoning), (b) iterative SFT refinement (+16% Arena-Hard), (c) dual-level distillation strategies (+8–30% across tasks), and (d) targeted RL specialization (+3–5 points per domain). No novel model structures contribute to these improvements.

Limitations include nondisclosure of detailed architectural hyperparameters, significant reliance on large teacher models and intricate SFT criteria (potentially restricting low-resource adaptation), and substantial computational overhead in filtering and multi-stage training. A plausible implication is that further progress may require streamlining the data- and compute-intensive pipeline or integrating more sample-efficient adaptation techniques. Future work is envisioned to push small-model capabilities further into autonomous software engineering, research agent tasks, and sophisticated multi-tool environments.

Further details, model checkpoints, and benchmark breakdowns are available at https://huggingface.co/Nanbeige and in the full technical report (Yang et al., 6 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Nanbeige4-3B.

Nanbeige4-3B: 3B-Scale Transformer Model

1. Model Structure and Representation

2. Pretraining Methodology

2.1 Data Collection and Filtering

2.2 Fine-Grained Warmup-Stable-Decay (FG-WSD) Scheduler

3. Supervised Fine-Tuning and Data Refinement

SFT Regime

Deliberative Generation Refinement

Chain-of-Thought (CoT) Reconstruction

4. Preference Distillation via Dual Objectives

5. Reinforcement Learning Specialization

6. Empirical Performance and Analysis

6.1 Benchmark Evaluation

6.2 Ablation Effects

7. Mechanistic Insights and Open Questions

Whiteboard

Follow Topic

Continue Learning

Nanbeige4-3B: 3B-Scale Transformer Model

1. Model Structure and Representation

2. Pretraining Methodology

2.1 Data Collection and Filtering

2.2 Fine-Grained Warmup-Stable-Decay (FG-WSD) Scheduler

3. Supervised Fine-Tuning and Data Refinement

SFT Regime

Deliberative Generation Refinement

Chain-of-Thought (CoT) Reconstruction

4. Preference Distillation via Dual Objectives

5. Reinforcement Learning Specialization

6. Empirical Performance and Analysis

6.1 Benchmark Evaluation

6.2 Ablation Effects

7. Mechanistic Insights and Open Questions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics