Ling-1T: Trillion-Param MoE LLM & Ling Adder

Updated 29 October 2025

Ling-1T is a dual-faceted system featuring a trillion-parameter MoE language model with a 3.5% token activation rate and a high-speed hardware Ling adder, optimizing both computational and latency performance.
The MoE LLM employs 256 experts per layer, advanced quantization (FP8), and multi-stage training techniques to attain over 7× active-compute efficiency and state-of-the-art benchmarks in math, code, and logic.
The Ling adder leverages minimal logic levels and reduced ripple propagation, resulting in faster binary addition and demonstrating innovative hardware design for low-latency digital circuits.

Ling-1T denotes two distinct lines of high-efficiency, large-scale AI systems: (1) Mixture-of-Experts (MoE) LLMs exemplified by Ling-Lite, Ling-Plus, and Ling-1T, and (2) circuit-level binary adders, specifically the high-speed “Ling Adder.” Both branches share a focus on minimizing latency and maximizing computational efficiency through architectural innovation; however, their domains and technical mechanisms are entirely orthogonal—LLMs vs. hardware adders.

1. Mixture-of-Experts Ling-1T LLMs: Definition and Architecture

Ling-1T, as introduced in "Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation" (Ling-Team et al., 25 Oct 2025), is a trillion-parameter reasoning LLM built on the Mixture-of-Experts (MoE) paradigm. The core architectural components are:

Experts per Layer: 256 routed experts per layer; only 8 (plus 1 shared expert) are active per token, yielding a ~3.5% activation ratio.
Activated Parameters: Of 1 trillion parameters, only 51B are used per inference token, enabling very high compute sparsity.
Layer Configuration: Initial 4 dense layers for stability and improved routing, followed by MoE layers.
Attention Block: Grouped-query attention (GQA), partial rotary position embeddings (first 64 dims), SwiGLU activations, pre-layer RMSNorm, QKNorm normalizations.
Tokenization: 156K BBPE byte-level vocabulary supporting extensive multilingual alignment.

MoE operation is expressed as:

$\mathbf{p}_t = \mathrm{Softmax}(\mathrm{R}(\mathbf{h}_t)), \qquad \mathbf{o}_t = \sum_{i \in \mathrm{Topk}(\mathbf{p}_t)} \mathbf{p}_{t,i}\mathrm{E}_i(\mathbf{h}_t)$

Ling models (Ling-Lite, Ling-Plus) scale from 16.8B to 290B parameters, retaining strong efficiency and competitive accuracy by sparse routing and fine-grained expert specialization (Team et al., 7 Mar 2025).

2. Efficiency, Scaling Laws, and Training Pipeline

Sparsity-Driven Efficiency Leverage

The Ling-1T model empirically achieves >7× active-compute efficiency compared to dense architectures. The efficiency leverage (EL) scaling law for MoE is formalized as:

$\text{EL}(A, G, C) = \hat A^{\alpha + \gamma(\log G)^2 + \beta \log G}$

where $A$ is the activation ratio, $G$ is expert granularity, and $C$ is the compute budget in FLOPs. With Ling-1T’s (A ≈ 3.5%, G=8), this predicts the empirically validated efficiency advantage.

Data Mixtures and Reasoning Optimization

Pre-training Data: Heavy infusion of task-specialized mathematical and code reasoning data (Ling Math, Ling Code datasets), which increase from 32% to 46% of the corpus during training.
Multi-stage Curriculum: General pre-training (20T tokens), then mid-training with extended contexts (up to 128K tokens) and explicit Chain-of-Thought (CoT) samples.
MTP (Multi-Token Prediction): Auxiliary head/loss for predicting multiple tokens, boosting reasoning accuracy (loss weight 0.1).
DFT/Evo-CoT: Decoupled fine-tuning and evolutionary RL (Evo-CoT), dual-mode supervised fine-tuning for instant responses and in-depth reasoning.
FP8 Deep Quantization: All activations and gradients in FP8 (per-channel statistics), trading a <0.25% loss in accuracy for a 15% gain in hardware throughput and reduced memory.

3. System and Infrastructure Co-Design

Distributed Training and Elastic Resource Utilization

EDiT: Elastic distributed local SGD with time-based synchronization and gradient penalty to eliminate stragglers and anomalies, delivering up to 66.1% speedup over standard synchronous training (Team et al., 7 Mar 2025).
Custom File Caching (PCache): All-flash distributed caching, user-space FUSE, distributed checkpoint writing; linear scaling of throughput with accelerator count.
Cross-Cluster Sync (Babel): Aggressive metadata prefetching and sampled CRC verification, reducing large-scale initialization overheads (e.g., 190M files from >6 hours to ≈10 minutes).
XPUTimer: Ultra-light runtime tracer for anomaly/bottleneck detection, O(1) memory per step.

Knowledge Graph Data for Tool Use

Synthetic Data: 14 knowledge graph subgraph patterns, first-order logic expansions, >30K instruction templates including real and synthetic API tasks.
Tool Use Training: Reasoned chaining, multi-hop tool selection, argument generation in realistic agent scenarios; Ling-Plus achieves benchmark-leading scores on function-calling, chaining, and external API benchmarks.

4. Empirical Results and Benchmarks

Ling-1T attains state-of-the-art accuracy per FLOP across math, code, logic, and knowledge benchmarks:

Model	MATH	HumanEval	ToolBench	Long-Context Retrieval
Ling-Lite	73	83	Best/Near	64K tokens
Ling-Plus	79	Best	Best	64K tokens
Ling-1T	SOTA	SOTA	SOTA	128K tokens

Reasoning: Leaderboards on MATH, CollegeMath, MinervaMath, HumanEval, MultiPL-E, OptiBench, AIME24/25, ARC-e/c.
Efficiency: Matches or exceeds models with equivalent activated FLOPs that require dense 1T-parameter computation.
Safety: Balanced helpfulness/harmlessness; competitive scores in refusal and safety tuning.

5. Hardware and Cost Implications

Commodity Accelerators: Designed and validated for training on hardware <120–370 TFLOPS, 64–96GB RAM, rather than premium H100/H800; democratizing trillion-parameter scaling.
Cost Savings: Training Ling-Plus (290B) on lower-spec hardware reduces costs by ≈20% (e.g., from 6.35M RMB to 5.08M RMB per 1T tokens).

6. The Ling-1T Adder: Hardware Perspective

In computer arithmetic, Ling-1T also refers to a hardware architecture for binary addition—specifically, the high-speed binary Ling adder (Gupta, 2019). Principal features:

Carry Computation: Employs adjacent bits ( $g_i, p_i, d_i$ ), where $g_i = a_i \cdot b_i$ , $p_i = a_i + b_i$ , $d_i = a_i \oplus b_i$ . Ling adder introduces “half-sum” bits for logic minimization.
Logic Depth: Reduces logic levels for 4-bit addition (Ling: 4; CLA: 5), improving propagation speed.
Ripple Reduction: Minimizes dependency on previous carries, thus lowering cumulative delay for wide adders.
Complexity: Circuit complexity grows for large n; practical for moderate-bit-width, less favorable for very high-bit VLSI where other adders may dominate.

7. Summary Table: Ling-1T Key Specifications

Attribute	Value
Total Params (LLM)	1T
Active per Forward	51B
MoE Experts per Layer	256 (8+1 routed)
Efficiency Leverage	>7× dense
Precision	FP8 end-to-end
Context Window (max)	128K tokens
Hardware Cost Advantage	≈20% savings (sub-premium)
Benchmark Leadership	Math, Code, Logic, Tool Use
Ling Adder Logic Levels	4 (vs. 5 for CLA)

8. Impact, Implications, and Future Directions

Ling-1T (LLM) establishes a new Pareto frontier, representing the highest known reasoning accuracy per computational cost at the trillion-parameter scale. The open foundation, co-designed scalable architecture, and tooling for democratized training underpin a robust blueprint for next-generation “thinking” AI (e.g., future Ring series).

The Ling adder, by contrast, contributes to hardware arithmetic design, particularly in contexts demanding minimized addition latency and where gate-count increases can be accepted for speed.

Both lines demonstrate scalable efficiency gains by architectural, routing, and data innovation—whether in digital logic or LLM design. The Ling-1T family serves as a reproducible base for high-efficiency language reasoning and for high-speed hardware addition circuits.

PDF Markdown Chat (Pro)

References (3)

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation (2025)

Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs (2025)

4-Bit High-Speed Binary Ling Adder (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Ling-1T.