Ling-1T: Trillion-Param MoE LLM & Ling Adder
- Ling-1T is a dual-faceted system featuring a trillion-parameter MoE language model with a 3.5% token activation rate and a high-speed hardware Ling adder, optimizing both computational and latency performance.
- The MoE LLM employs 256 experts per layer, advanced quantization (FP8), and multi-stage training techniques to attain over 7× active-compute efficiency and state-of-the-art benchmarks in math, code, and logic.
- The Ling adder leverages minimal logic levels and reduced ripple propagation, resulting in faster binary addition and demonstrating innovative hardware design for low-latency digital circuits.
Ling-1T denotes two distinct lines of high-efficiency, large-scale AI systems: (1) Mixture-of-Experts (MoE) LLMs exemplified by Ling-Lite, Ling-Plus, and Ling-1T, and (2) circuit-level binary adders, specifically the high-speed “Ling Adder.” Both branches share a focus on minimizing latency and maximizing computational efficiency through architectural innovation; however, their domains and technical mechanisms are entirely orthogonal—LLMs vs. hardware adders.
1. Mixture-of-Experts Ling-1T LLMs: Definition and Architecture
Ling-1T, as introduced in "Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation" (Ling-Team et al., 25 Oct 2025), is a trillion-parameter reasoning LLM built on the Mixture-of-Experts (MoE) paradigm. The core architectural components are:
- Experts per Layer: 256 routed experts per layer; only 8 (plus 1 shared expert) are active per token, yielding a ~3.5% activation ratio.
- Activated Parameters: Of 1 trillion parameters, only 51B are used per inference token, enabling very high compute sparsity.
- Layer Configuration: Initial 4 dense layers for stability and improved routing, followed by MoE layers.
- Attention Block: Grouped-query attention (GQA), partial rotary position embeddings (first 64 dims), SwiGLU activations, pre-layer RMSNorm, QKNorm normalizations.
- Tokenization: 156K BBPE byte-level vocabulary supporting extensive multilingual alignment.
MoE operation is expressed as:
Ling models (Ling-Lite, Ling-Plus) scale from 16.8B to 290B parameters, retaining strong efficiency and competitive accuracy by sparse routing and fine-grained expert specialization (Team et al., 7 Mar 2025).
2. Efficiency, Scaling Laws, and Training Pipeline
Sparsity-Driven Efficiency Leverage
The Ling-1T model empirically achieves >7× active-compute efficiency compared to dense architectures. The efficiency leverage (EL) scaling law for MoE is formalized as:
where is the activation ratio, is expert granularity, and is the compute budget in FLOPs. With Ling-1T’s (A ≈ 3.5%, G=8), this predicts the empirically validated efficiency advantage.
Data Mixtures and Reasoning Optimization
- Pre-training Data: Heavy infusion of task-specialized mathematical and code reasoning data (Ling Math, Ling Code datasets), which increase from 32% to 46% of the corpus during training.
- Multi-stage Curriculum: General pre-training (20T tokens), then mid-training with extended contexts (up to 128K tokens) and explicit Chain-of-Thought (CoT) samples.
- MTP (Multi-Token Prediction): Auxiliary head/loss for predicting multiple tokens, boosting reasoning accuracy (loss weight 0.1).
- DFT/Evo-CoT: Decoupled fine-tuning and evolutionary RL (Evo-CoT), dual-mode supervised fine-tuning for instant responses and in-depth reasoning.
- FP8 Deep Quantization: All activations and gradients in FP8 (per-channel statistics), trading a <0.25% loss in accuracy for a 15% gain in hardware throughput and reduced memory.
3. System and Infrastructure Co-Design
Distributed Training and Elastic Resource Utilization
- EDiT: Elastic distributed local SGD with time-based synchronization and gradient penalty to eliminate stragglers and anomalies, delivering up to 66.1% speedup over standard synchronous training (Team et al., 7 Mar 2025).
- Custom File Caching (PCache): All-flash distributed caching, user-space FUSE, distributed checkpoint writing; linear scaling of throughput with accelerator count.
- Cross-Cluster Sync (Babel): Aggressive metadata prefetching and sampled CRC verification, reducing large-scale initialization overheads (e.g., 190M files from >6 hours to ≈10 minutes).
- XPUTimer: Ultra-light runtime tracer for anomaly/bottleneck detection, O(1) memory per step.
Knowledge Graph Data for Tool Use
- Synthetic Data: 14 knowledge graph subgraph patterns, first-order logic expansions, >30K instruction templates including real and synthetic API tasks.
- Tool Use Training: Reasoned chaining, multi-hop tool selection, argument generation in realistic agent scenarios; Ling-Plus achieves benchmark-leading scores on function-calling, chaining, and external API benchmarks.
4. Empirical Results and Benchmarks
Ling-1T attains state-of-the-art accuracy per FLOP across math, code, logic, and knowledge benchmarks:
| Model | MATH | HumanEval | ToolBench | Long-Context Retrieval |
|---|---|---|---|---|
| Ling-Lite | 73 | 83 | Best/Near | 64K tokens |
| Ling-Plus | 79 | Best | Best | 64K tokens |
| Ling-1T | SOTA | SOTA | SOTA | 128K tokens |
- Reasoning: Leaderboards on MATH, CollegeMath, MinervaMath, HumanEval, MultiPL-E, OptiBench, AIME24/25, ARC-e/c.
- Efficiency: Matches or exceeds models with equivalent activated FLOPs that require dense 1T-parameter computation.
- Safety: Balanced helpfulness/harmlessness; competitive scores in refusal and safety tuning.
5. Hardware and Cost Implications
- Commodity Accelerators: Designed and validated for training on hardware <120–370 TFLOPS, 64–96GB RAM, rather than premium H100/H800; democratizing trillion-parameter scaling.
- Cost Savings: Training Ling-Plus (290B) on lower-spec hardware reduces costs by ≈20% (e.g., from 6.35M RMB to 5.08M RMB per 1T tokens).
6. The Ling-1T Adder: Hardware Perspective
In computer arithmetic, Ling-1T also refers to a hardware architecture for binary addition—specifically, the high-speed binary Ling adder (Gupta, 2019). Principal features:
- Carry Computation: Employs adjacent bits (), where , , . Ling adder introduces “half-sum” bits for logic minimization.
- Logic Depth: Reduces logic levels for 4-bit addition (Ling: 4; CLA: 5), improving propagation speed.
- Ripple Reduction: Minimizes dependency on previous carries, thus lowering cumulative delay for wide adders.
- Complexity: Circuit complexity grows for large n; practical for moderate-bit-width, less favorable for very high-bit VLSI where other adders may dominate.
7. Summary Table: Ling-1T Key Specifications
| Attribute | Value |
|---|---|
| Total Params (LLM) | 1T |
| Active per Forward | 51B |
| MoE Experts per Layer | 256 (8+1 routed) |
| Efficiency Leverage | >7× dense |
| Precision | FP8 end-to-end |
| Context Window (max) | 128K tokens |
| Hardware Cost Advantage | ≈20% savings (sub-premium) |
| Benchmark Leadership | Math, Code, Logic, Tool Use |
| Ling Adder Logic Levels | 4 (vs. 5 for CLA) |
8. Impact, Implications, and Future Directions
Ling-1T (LLM) establishes a new Pareto frontier, representing the highest known reasoning accuracy per computational cost at the trillion-parameter scale. The open foundation, co-designed scalable architecture, and tooling for democratized training underpin a robust blueprint for next-generation “thinking” AI (e.g., future Ring series).
The Ling adder, by contrast, contributes to hardware arithmetic design, particularly in contexts demanding minimized addition latency and where gate-count increases can be accepted for speed.
Both lines demonstrate scalable efficiency gains by architectural, routing, and data innovation—whether in digital logic or LLM design. The Ling-1T family serves as a reproducible base for high-efficiency language reasoning and for high-speed hardware addition circuits.