Papers
Topics
Authors
Recent
2000 character limit reached

Ling-1T: Trillion-Param MoE LLM & Ling Adder

Updated 29 October 2025
  • Ling-1T is a dual-faceted system featuring a trillion-parameter MoE language model with a 3.5% token activation rate and a high-speed hardware Ling adder, optimizing both computational and latency performance.
  • The MoE LLM employs 256 experts per layer, advanced quantization (FP8), and multi-stage training techniques to attain over 7× active-compute efficiency and state-of-the-art benchmarks in math, code, and logic.
  • The Ling adder leverages minimal logic levels and reduced ripple propagation, resulting in faster binary addition and demonstrating innovative hardware design for low-latency digital circuits.

Ling-1T denotes two distinct lines of high-efficiency, large-scale AI systems: (1) Mixture-of-Experts (MoE) LLMs exemplified by Ling-Lite, Ling-Plus, and Ling-1T, and (2) circuit-level binary adders, specifically the high-speed “Ling Adder.” Both branches share a focus on minimizing latency and maximizing computational efficiency through architectural innovation; however, their domains and technical mechanisms are entirely orthogonal—LLMs vs. hardware adders.

1. Mixture-of-Experts Ling-1T LLMs: Definition and Architecture

Ling-1T, as introduced in "Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation" (Ling-Team et al., 25 Oct 2025), is a trillion-parameter reasoning LLM built on the Mixture-of-Experts (MoE) paradigm. The core architectural components are:

  • Experts per Layer: 256 routed experts per layer; only 8 (plus 1 shared expert) are active per token, yielding a ~3.5% activation ratio.
  • Activated Parameters: Of 1 trillion parameters, only 51B are used per inference token, enabling very high compute sparsity.
  • Layer Configuration: Initial 4 dense layers for stability and improved routing, followed by MoE layers.
  • Attention Block: Grouped-query attention (GQA), partial rotary position embeddings (first 64 dims), SwiGLU activations, pre-layer RMSNorm, QKNorm normalizations.
  • Tokenization: 156K BBPE byte-level vocabulary supporting extensive multilingual alignment.

MoE operation is expressed as:

pt=Softmax(R(ht)),ot=iTopk(pt)pt,iEi(ht)\mathbf{p}_t = \mathrm{Softmax}(\mathrm{R}(\mathbf{h}_t)), \qquad \mathbf{o}_t = \sum_{i \in \mathrm{Topk}(\mathbf{p}_t)} \mathbf{p}_{t,i}\mathrm{E}_i(\mathbf{h}_t)

Ling models (Ling-Lite, Ling-Plus) scale from 16.8B to 290B parameters, retaining strong efficiency and competitive accuracy by sparse routing and fine-grained expert specialization (Team et al., 7 Mar 2025).

2. Efficiency, Scaling Laws, and Training Pipeline

Sparsity-Driven Efficiency Leverage

The Ling-1T model empirically achieves >7× active-compute efficiency compared to dense architectures. The efficiency leverage (EL) scaling law for MoE is formalized as:

EL(A,G,C)=A^α+γ(logG)2+βlogG\text{EL}(A, G, C) = \hat A^{\alpha + \gamma(\log G)^2 + \beta \log G}

where AA is the activation ratio, GG is expert granularity, and CC is the compute budget in FLOPs. With Ling-1T’s (A ≈ 3.5%, G=8), this predicts the empirically validated efficiency advantage.

Data Mixtures and Reasoning Optimization

  • Pre-training Data: Heavy infusion of task-specialized mathematical and code reasoning data (Ling Math, Ling Code datasets), which increase from 32% to 46% of the corpus during training.
  • Multi-stage Curriculum: General pre-training (20T tokens), then mid-training with extended contexts (up to 128K tokens) and explicit Chain-of-Thought (CoT) samples.
  • MTP (Multi-Token Prediction): Auxiliary head/loss for predicting multiple tokens, boosting reasoning accuracy (loss weight 0.1).
  • DFT/Evo-CoT: Decoupled fine-tuning and evolutionary RL (Evo-CoT), dual-mode supervised fine-tuning for instant responses and in-depth reasoning.
  • FP8 Deep Quantization: All activations and gradients in FP8 (per-channel statistics), trading a <0.25% loss in accuracy for a 15% gain in hardware throughput and reduced memory.

3. System and Infrastructure Co-Design

Distributed Training and Elastic Resource Utilization

  • EDiT: Elastic distributed local SGD with time-based synchronization and gradient penalty to eliminate stragglers and anomalies, delivering up to 66.1% speedup over standard synchronous training (Team et al., 7 Mar 2025).
  • Custom File Caching (PCache): All-flash distributed caching, user-space FUSE, distributed checkpoint writing; linear scaling of throughput with accelerator count.
  • Cross-Cluster Sync (Babel): Aggressive metadata prefetching and sampled CRC verification, reducing large-scale initialization overheads (e.g., 190M files from >6 hours to ≈10 minutes).
  • XPUTimer: Ultra-light runtime tracer for anomaly/bottleneck detection, O(1) memory per step.

Knowledge Graph Data for Tool Use

  • Synthetic Data: 14 knowledge graph subgraph patterns, first-order logic expansions, >30K instruction templates including real and synthetic API tasks.
  • Tool Use Training: Reasoned chaining, multi-hop tool selection, argument generation in realistic agent scenarios; Ling-Plus achieves benchmark-leading scores on function-calling, chaining, and external API benchmarks.

4. Empirical Results and Benchmarks

Ling-1T attains state-of-the-art accuracy per FLOP across math, code, logic, and knowledge benchmarks:

Model MATH HumanEval ToolBench Long-Context Retrieval
Ling-Lite 73 83 Best/Near 64K tokens
Ling-Plus 79 Best Best 64K tokens
Ling-1T SOTA SOTA SOTA 128K tokens
  • Reasoning: Leaderboards on MATH, CollegeMath, MinervaMath, HumanEval, MultiPL-E, OptiBench, AIME24/25, ARC-e/c.
  • Efficiency: Matches or exceeds models with equivalent activated FLOPs that require dense 1T-parameter computation.
  • Safety: Balanced helpfulness/harmlessness; competitive scores in refusal and safety tuning.

5. Hardware and Cost Implications

  • Commodity Accelerators: Designed and validated for training on hardware <120–370 TFLOPS, 64–96GB RAM, rather than premium H100/H800; democratizing trillion-parameter scaling.
  • Cost Savings: Training Ling-Plus (290B) on lower-spec hardware reduces costs by ≈20% (e.g., from 6.35M RMB to 5.08M RMB per 1T tokens).

6. The Ling-1T Adder: Hardware Perspective

In computer arithmetic, Ling-1T also refers to a hardware architecture for binary addition—specifically, the high-speed binary Ling adder (Gupta, 2019). Principal features:

  • Carry Computation: Employs adjacent bits (gi,pi,dig_i, p_i, d_i), where gi=aibig_i = a_i \cdot b_i, pi=ai+bip_i = a_i + b_i, di=aibid_i = a_i \oplus b_i. Ling adder introduces “half-sum” bits for logic minimization.
  • Logic Depth: Reduces logic levels for 4-bit addition (Ling: 4; CLA: 5), improving propagation speed.
  • Ripple Reduction: Minimizes dependency on previous carries, thus lowering cumulative delay for wide adders.
  • Complexity: Circuit complexity grows for large n; practical for moderate-bit-width, less favorable for very high-bit VLSI where other adders may dominate.

7. Summary Table: Ling-1T Key Specifications

Attribute Value
Total Params (LLM) 1T
Active per Forward 51B
MoE Experts per Layer 256 (8+1 routed)
Efficiency Leverage >7× dense
Precision FP8 end-to-end
Context Window (max) 128K tokens
Hardware Cost Advantage ≈20% savings (sub-premium)
Benchmark Leadership Math, Code, Logic, Tool Use
Ling Adder Logic Levels 4 (vs. 5 for CLA)

8. Impact, Implications, and Future Directions

Ling-1T (LLM) establishes a new Pareto frontier, representing the highest known reasoning accuracy per computational cost at the trillion-parameter scale. The open foundation, co-designed scalable architecture, and tooling for democratized training underpin a robust blueprint for next-generation “thinking” AI (e.g., future Ring series).

The Ling adder, by contrast, contributes to hardware arithmetic design, particularly in contexts demanding minimized addition latency and where gate-count increases can be accepted for speed.

Both lines demonstrate scalable efficiency gains by architectural, routing, and data innovation—whether in digital logic or LLM design. The Ling-1T family serves as a reproducible base for high-efficiency language reasoning and for high-speed hardware addition circuits.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Ling-1T.