Ling 2.0: Scalable Sparse MoE Models

Updated 29 October 2025

Ling 2.0 is a series of reasoning-focused models that employ sparse Mixture-of-Experts architecture to activate only a subset of experts per token.
The design integrates transformer frameworks, grouped-query attention, and specialized tokenization to enhance multi-modal and mathematical reasoning.
Empirical scaling laws demonstrate up to 7× compute efficiency, setting reproducible standards for scalable, advanced reasoning in large-scale models.

Ling 2.0 designates a series of reasoning-oriented LLMs centered on scalable efficiency and sparse activation, unified under the Mixture-of-Experts (MoE) paradigm, and combined with architectural, pretraining, post-training, and infrastructure innovations. The Lin 2.0 family is engineered to maximize reasoning capability per compute unit ("every activation boosts reasoning capability"), spanning models from tens of billions to one trillion parameters, and establishing empirical, reproducible scaling laws for both computational leverage and alignment with reasoning-centric objectives (Ling-Team et al., 25 Oct 2025).

1. Historical Motivation and Conceptual Foundations

Ling 2.0 builds upon empirical findings that dense LLM scaling yields sublinear improvements in reasoning skill per additional parameter or compute unit. The series reframes model scaling through sparse activation: only a small, intelligently selected subset of the model ("experts") is actively computed per token, which, when aligned with reasoning-oriented data and objectives, produces up to 7-fold active-compute efficiency versus dense counterparts. This efficiency leverage is quantitatively modeled, validated at the trillion-parameter scale, and consistently realized across the family.

The central principle—every activation boosts reasoning capability—demands that the model's active computation at each inference step is directly engineered to maximize advanced logical, stepwise, and multi-modal reasoning rather than mere next-token prediction.

2. Core Architectural Innovations

Ling 2.0 models universally employ a transformer-based high-sparsity MoE design. Each layer contains 256 routed experts plus a single shared (global) expert. For each input token,

Activation: Only 8 routed experts (E_{routed}) and one shared expert (E_{share}) are activated (9/257 ≈ 3.5% per layer).
Routing: Token representations are passed via a learnable router network (R), yielding softmax selection scores and top-k expert assignment. Output is aggregated as:

$\mathbf{p}_t = \mathrm{Softmax}(\mathrm{R}(\mathbf{h}_t)), \quad \mathbf{o}_t = \sum_i \mathbf{p}_{t,i} \mathrm{E}_i(\mathbf{h}_t), \quad \mathbf{p}_{t,i} \in \mathrm{Topk}(\mathbf{p}_t)$

where the shared expert is always applied.

Additional Features:
- Multi-Token Prediction (MTP) Head: An auxiliary head for MTP loss improves math and code reasoning fidelity.
- Aux-Loss-Free Load Balancing: Implicit balancing of expert utilization without explicit auxiliary loss layers.
- Dense Layer Initialization: Early layers are dense to stabilize routing and learning.
- Grouped-Query Attention (GQA): Increases attention efficiency and memory utilization.

The byte-based BPE tokenizer features a 156k vocabulary for multilingual and mathematical code objectives. SwiGLU activation, RMSNorm (pre-norm), and QKNorm are employed for stability, particularly in FP8 training.

3. Empirical Scaling Laws and Efficiency Leverage

Derived from extensive experiments and formalized in scaling law literature (Tian et al., 23 Jul 2025), Ling 2.0 efficiency leverage (EL) is defined as the ratio of FLOPs necessary for a dense model to reach the same loss/performance as the MoE model:

$EL(\mathcal{X}_{MoE} \mid \mathcal{X}_{Dense}) = \frac{C_{dense}}{C_{moe}}, \qquad \left|\mathcal{L}(C_{moe}; \mathcal{X}_{MoE}) - \mathcal{L}(C_{dense}; \mathcal{X}_{Dense})\right| \leq \epsilon$

Key scaling law (joint law, validated at all sizes):

$EL(A, G, C) = \hat{A}^{\alpha + \gamma (\log G)^2 + \beta \log G}$

with $\hat{A}$ as the saturating function of activation ratio, $G$ (granularity, experts per FFN), $C$ (total compute), and empirically fitted exponents. For Ling 2.0 configuration, this results in $>7\times$ efficiency leverage at scale—the model matches or surpasses dense counterparts at 1/7th the compute.

The "Wind Tunnel" evaluation methodology permits low-cost cross-scale validation: architectural/design features must demonstrate scalable benefit in smaller models before 1T deployment.

4. Coordinated Pipeline: Data, Training, and Reinforcement Alignment

Pre-training

Reasoning-centric data: Dedicated math and code corpora ("Ling Math", "Ling Code") are progressively increased from 32% to 46% share during pre-training. Benchmarks and ablation confirm outperformance versus open alternatives.
CoT Mid-training: Chain-of-thought data is introduced mid-training to prime inner reasoning circuits, improving SFT/RL effectiveness.
WSM Scheduler: Checkpoint merging replaces traditional LR decay, shown theoretical equivalent but delivering 1–2% empirical accuracy gain.

Post-training

Decoupled Fine-Tuning (DFT): Differential prompts enable rapid switching between instant-response and in-depth reasoning modes, strengthening base diversity for RL.
Evolutionary CoT Reinforcement (Evo-CoT): The RL phase dynamically increases reasoning depth via a reward function carefully balancing accuracy and response length relative to question complexity; optimization at the sentence-level (Linguistic-unit Policy Optimization, LPO) improves granularity and stability.
Group Arena Reward (GAR) and RubriX: Implements arena-style intra-group evaluation with rubrics.

Engineering

End-to-End FP8 Training: Entire series trains in FP8 using block-wise quantization, lowering memory usage and boosting hardware utilization. Empirically < 0.25% accuracy delta on Ling-1T versus BF16.
Heterogeneous Fine-Grained Pipeline: Fine-tuned partitioning and scheduling for MoE+MTP bottlenecks; throughput gains of up to 40%.
Distributed Training: DeepEP intra-node optimization, fused kernels, robust checkpointing, and enforced cross-platform numerical alignment (the 4C principle).
Long-context: Sequence length extended to 16K–64K; models demonstrate nearly perfect retrieval in "needle-in-a-haystack" assessments.

5. Trillion-Parameter Model Results: Ling-1T on the Pareto Frontier

The flagship Ling-1T model achieves a new Pareto frontier of reasoning accuracy versus computational cost. On benchmarks such as AIME-25 (mathematics), LiveCodeBench (code), and ZebraLogic (logic), Ling-1T's curve lies ahead of all comparable models. Empirical validation of scaling laws confirms that Ling-1T (with 51B active parameters per forward pass, 1T total) matches or surpasses dense models requiring 300B+ active parameters, while controlling both accuracy and reasoning depth. Results consistently show smaller variants (Ling-mini-2.0, Ling-flash-2.0) at or above dense model performance for their respective sizes.

6. Implications and Future Directions

Ling 2.0 is architected as an open, reproducible foundation for efficient reasoning and future agentic ("thinking") models. Its theoretical and empirical underpinnings—in scaling laws, reasoning-centric design, and robust distributed engineering—set reproducible standards for the community and establish a bridge to advanced, autonomously agentic models (the Ring series). This suite democratizes trillion-parameter model development by providing efficiency, accuracy, and real validated design roadmaps, rather than relying on brute-force dense scaling.

7. Representative Model Table

Model	Total Parameters	Activated Parameters	Experts	Activated/Token	Typical Use
Ling-mini-2.0	16B	1.4B	256	8+1	Efficient instruct
Ling-flash-2.0	103B	6.1B	256	8+1	Medium-scale reasoning
Ling-1T	1T	51B	256	8+1	Trillion-scale general reasoner

Conclusion

Ling 2.0 establishes, both empirically and theoretically, that highly sparse, reasoning-oriented, Mixture-of-Experts models can be reproducibly scaled to the trillion-parameter regime, setting new baselines for reasoning accuracy and efficiency. The coordinated deployment of architectural sparsity, reasoning-aligned objectives, robust pipeline engineering, and formal scaling laws redefines practical and scientific boundaries for large-scale language modeling and logic-centric AI. The ongoing evolution into deep agentic models positions Ling 2.0 as an industry and research foundation for scalable, efficient, and interpretable artificial general reasoning systems (Ling-Team et al., 25 Oct 2025, Tian et al., 23 Jul 2025, Team et al., 7 Mar 2025, Codefuse et al., 22 Mar 2025).