Papers
Topics
Authors
Recent
2000 character limit reached

Nanbeige4-3B: 3B-Scale Transformer Model

Updated 13 December 2025
  • Nanbeige4-3B is a family of small-scale, high-performing language models based on a decoder-only Transformer architecture with approximately 3 billion parameters.
  • It utilizes a Fine-Grained Warmup-Stable-Decay scheduler, multi-stage SFT, and dual preference distillation to enhance token-level and sequence-level performance.
  • Empirical evaluations show that Nanbeige4-3B outperforms larger models on benchmarks like AIME2024 and GPQA-Diamond, highlighting its effective scaling through methodological innovations.

Nanbeige4-3B is a family of small-scale, high-performing LLMs based on the decoder-only Transformer architecture. Designed to extend the scaling law frontier for small LLMs, Nanbeige4-3B demonstrates that with sophisticated data curation, curriculum strategies, and targeted post-training, models of this scale (≈3B parameters) can achieve or surpass the performance of significantly larger models on a variety of challenging benchmarks. All gains over prior models are attributed to pretraining data quality, specialized fine-tuning procedures, novel distillation objectives, and reinforcement learning pipelines, rather than to architectural innovations (Yang et al., 6 Dec 2025).

1. Model Structure and Representation

Nanbeige4-3B is implemented as a decoder-only Transformer encompassing approximately 3 billion parameters. The technical report does not specify the exact architectural breakdown (number of layers, hidden dimension, or attention heads). By analogy with other models in this class, such as Qwen3-4B, a typical configuration would be 30–36 layers, a hidden size close to 4096, and 32 attention heads. Nanbeige4-3B employs Rotary Position Embeddings (RoPE) extended to a context length of 64K tokens using the Adjusting Base Frequency (ABF) technique [xiong2023effectivelongcontextscalingfoundation]. No new architectural blocks—such as alternate attention mechanisms or mixture-of-experts—are introduced; improvements are entirely methodological and data-driven.

2. Pretraining Methodology

2.1 Data Collection and Filtering

The pretraining corpus integrates 23T tokens drawn from web pages, scholarly PDFs, books, source code, and synthetic data (e.g., QA, chain-of-thought, textbook-style samples). A hybrid filtering pipeline combines multi-dimensional quality tagging (20 distinct quality scores on a 0–9 scale, capturing properties such as knowledge density, reasoning depth, and fluency) and retrieval-based scoring relative to a high-quality reference set. This filtering yields 12.5T "good" tokens, with 6.5T further up-sampled (≥2×) to compose the final corpus.

2.2 Fine-Grained Warmup-Stable-Decay (FG-WSD) Scheduler

Pretraining employs a Fine-Grained Warmup-Stable-Decay scheduler comprising four sequential phases:

Phase Tokens Description
Warmup 0.1T LR ramps from 0 to μ_max
Diversity-Enriched Stable 12.4T Constant LR (μ_max), mixed‐quality data; quality weighting shifts
High-Quality Stable 6.5T Constant LR (μ_max), only top‐quality data
Decay 4T LR decays from μ_max to μ_min; only high-quality data

The learning rate μ(t)\mu(t) follows: μ(t)={tTwarm  μmax,0t<Twarm μmax,Twarmt<Twarm+Tdiv+THQ μmax×(1t(Twarm+Tdiv+THQ)Tdecay),otherwise\mu(t) = \begin{cases} \frac{t}{T_{\rm warm}}\;\mu_{\max}, & 0 \le t < T_{\rm warm} \ \mu_{\max}, & T_{\rm warm} \le t < T_{\rm warm} + T_{\rm div} + T_{\rm HQ} \ \mu_{\max} \times \left( 1 - \frac{t - (T_{\rm warm} + T_{\rm div} + T_{\rm HQ})}{T_{\rm decay}} \right), & \text{otherwise} \end{cases} with μmax=4.5×104\mu_{\max} = 4.5 \times 10^{-4} and μmin=1.5×106\mu_{\min} = 1.5 \times 10^{-6}. Within the Diversity-Enriched Stable phase, stagewise mixture weights begin with an MQ:HQ ratio of 2:1 and gradually shift to fully high-quality data at the end of this phase (formally, wHQ(s)=αw_{\mathrm{HQ}}(s) = \alpha for the initial stage, then 1.0 thereafter, with α=1/3\alpha = 1/3 in the toy experiment and appropriately scaled for full model pretraining).

3. Supervised Fine-Tuning and Data Refinement

SFT Regime

Nanbeige4-3B-Thinking (the "capstone" model in the family) undergoes two SFT stages:

  • Cold-Start SFT: 30M QA samples (50% math, 30% science, 20% code), context length 32K.
  • Full SFT: Diversified instruction mixing (40% reasoning, 30% QA/writing, 20% agent-style, 10% code), context length 64K.

Deliberative Generation Refinement

Each instruction II is paired with a multi-dimensional checklist CIC_I (criteria: correctness, completeness, consistency, executability, safety). Candidate completions {yi}\{y_i\} are generated by the model and one or more teachers. Each sample is evaluated against CIC_I using an automatic evaluator, feedback FiF_i is appended to the prompt, and new completions are generated iteratively until improvement saturates.

Chain-of-Thought (CoT) Reconstruction

The best solution y^\hat{y} is further expanded by a separate chain-completion model, which produces a summary chain s1s_1 and detailed CoT s2s_2. The final SFT sample is

(I,s1 ⁣ ⁣s2y^)(I,\, s_1\!\parallel\!s_2 \| \hat{y})

where "\parallel" and "\|" denote concatenation.

4. Preference Distillation via Dual Objectives

To align Nanbeige4-3B to both token-level likelihoods and sequence-level preferences, Dual Preference Distillation (DPD) is introduced. Nanbeige4-3B student (SθS_\theta) is distilled from a teacher model (TT) using two loss terms:

  • Token-Level Distillation: For the best sample y+y^+ (from the teacher) and a negative sample yy^- (from the student), token distillation loss: LKD(y)=t=1yKL(pT(yty<t)pS(yty<t))\mathcal{L}_{\rm KD}(y) = \sum_{t=1}^{|y|} \mathrm{KL}(p_T(y_t | y_{<t}) \| p_S(y_t | y_{<t}))
  • Sequence-Level DPO Margin Loss: With rS(y)=logpS(y)r_S(y) = \log p_S(y),

LDPO(y+,y)=logσ(rS(y+)rS(y)δ)\mathcal{L}_{\rm DPO}(y^+, y^-) = -\log \sigma (r_S(y^+) - r_S(y^-) - \delta)

where δ\delta is a margin hyperparameter.

The joint DPD objective is

LDPD=LKD(y+)+βLKD(y)+λLDPO(y+,y)\mathcal{L}_{\rm DPD} = \mathcal{L}_{\rm KD}(y^+) + \beta \, \mathcal{L}_{\rm KD}(y^-) + \lambda \, \mathcal{L}_{\rm DPO}(y^+, y^-)

This procedure enables the student to match the teacher both locally (token-wise distributions) and globally (ranked sequence preference), as in DPO [rafailov2023dpo].

5. Reinforcement Learning Specialization

Nanbeige4-3B utilizes a three-stage, on-policy reinforcement learning framework based on GRPO [shao2024deepseekmathpushinglimitsmathematical], with policy truncation masks as in DAPO [yu2025dapoopensourcellmreinforcement], omitting KL regularization.

Pre-stage Filtering: Before each RL stage, the prevailing policy πold\pi_{\text{old}} is applied to data; only samples with an avg@16 pass-rate in [10%, 90%] are retained.

RL Stages:

  • STEM RL: Math and science tasks, with binary reward via Python-based programmatic verifiers.
  • Coding RL: Multi-language programming, reward =1= 1 if generated code passes all private test cases; otherwise 0.
  • Human Preference RL: Creative writing/dialogue, reward via a pairwise comparison model fpair(yref,ygen)[0,1]f_{\text{pair}}(y_{\text{ref}}, y_{\text{gen}}) \in [0, 1] trained to match human judgments.

The RL objective is

J(θ)=Eyπθ(I)[R(I,y)]J(\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot | I)} [R(I, y)]

Learning rates and batch sizes are held constant throughout.

6. Empirical Performance and Analysis

6.1 Benchmark Evaluation

Nanbeige4-3B is evaluated against the Qwen3 (4B–32B) series across mathematics, science, coding, and human preference alignment tasks:

Benchmark Qwen3-4B Qwen3-8B Qwen3-14B Qwen3-30B-A3B Qwen3-32B Nanbeige4-3B
AIME2025 81.3 67.3 70.4 85.0 72.9 85.6
AIME2024 83.3 76.0 79.3 89.2 81.4 90.4
GPQA-Diamond 67.2 62.0 64.0 73.4 68.7 82.2
SuperGPQA 46.7 39.1 46.8 56.8 54.1 53.2
BFCL-V4 44.9 42.2 45.4 48.6 47.9 53.8
FullstackBench 47.1 51.5 55.7 54.4 58.2 48.0
ArenaHard-V2 40.5 26.4 39.9 60.0 48.4 60.0
Multi-Challenge 41.8 35.8 36.4 49.4 39.2 41.2

In mathematical and scientific reasoning tasks, Nanbeige4-3B generally surpasses all comparably sized models and equals or exceeds much larger configurations.

6.2 Ablation Effects

  • FG-WSD outperforms vanilla WSD by 5–7 points on hard reasoning benchmarks (e.g., GSM8K +7.2, CMATH +5.0, BBH +2.3 on a 1B-parameter model at 1T tokens).
  • SFT with deliberative refinement and CoT reconstruction improves Arena-Hard-V2 by 16% absolute.
  • DPD provides relative gains: AIME24/25 +8%, GPQA +10%, BFCL-V4 +30%.
  • RL stages yield further domain-specific improvements: STEM RL (+2–3 AIME), Coding RL (+4 points pass@1), Preference RL (+5% Arena-Hard).

7. Mechanistic Insights and Open Questions

The chief performance gains are attributable to (a) FG-WSD data curriculum (+5–7 points over vanilla WSD for complex reasoning), (b) iterative SFT refinement (+16% Arena-Hard), (c) dual-level distillation strategies (+8–30% across tasks), and (d) targeted RL specialization (+3–5 points per domain). No novel model structures contribute to these improvements.

Limitations include nondisclosure of detailed architectural hyperparameters, significant reliance on large teacher models and intricate SFT criteria (potentially restricting low-resource adaptation), and substantial computational overhead in filtering and multi-stage training. A plausible implication is that further progress may require streamlining the data- and compute-intensive pipeline or integrating more sample-efficient adaptation techniques. Future work is envisioned to push small-model capabilities further into autonomous software engineering, research agent tasks, and sophisticated multi-tool environments.


Further details, model checkpoints, and benchmark breakdowns are available at https://huggingface.co/Nanbeige and in the full technical report (Yang et al., 6 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Nanbeige4-3B.