Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 65 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

Ling-1T: Trillion-Param MoE LLM & Ling Adder

Updated 29 October 2025
  • Ling-1T is a dual-faceted system featuring a trillion-parameter MoE language model with a 3.5% token activation rate and a high-speed hardware Ling adder, optimizing both computational and latency performance.
  • The MoE LLM employs 256 experts per layer, advanced quantization (FP8), and multi-stage training techniques to attain over 7× active-compute efficiency and state-of-the-art benchmarks in math, code, and logic.
  • The Ling adder leverages minimal logic levels and reduced ripple propagation, resulting in faster binary addition and demonstrating innovative hardware design for low-latency digital circuits.

Ling-1T denotes two distinct lines of high-efficiency, large-scale AI systems: (1) Mixture-of-Experts (MoE) LLMs exemplified by Ling-Lite, Ling-Plus, and Ling-1T, and (2) circuit-level binary adders, specifically the high-speed “Ling Adder.” Both branches share a focus on minimizing latency and maximizing computational efficiency through architectural innovation; however, their domains and technical mechanisms are entirely orthogonal—LLMs vs. hardware adders.

1. Mixture-of-Experts Ling-1T LLMs: Definition and Architecture

Ling-1T, as introduced in "Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation" (Ling-Team et al., 25 Oct 2025), is a trillion-parameter reasoning LLM built on the Mixture-of-Experts (MoE) paradigm. The core architectural components are:

  • Experts per Layer: 256 routed experts per layer; only 8 (plus 1 shared expert) are active per token, yielding a ~3.5% activation ratio.
  • Activated Parameters: Of 1 trillion parameters, only 51B are used per inference token, enabling very high compute sparsity.
  • Layer Configuration: Initial 4 dense layers for stability and improved routing, followed by MoE layers.
  • Attention Block: Grouped-query attention (GQA), partial rotary position embeddings (first 64 dims), SwiGLU activations, pre-layer RMSNorm, QKNorm normalizations.
  • Tokenization: 156K BBPE byte-level vocabulary supporting extensive multilingual alignment.

MoE operation is expressed as:

pt=Softmax(R(ht)),ot=iTopk(pt)pt,iEi(ht)\mathbf{p}_t = \mathrm{Softmax}(\mathrm{R}(\mathbf{h}_t)), \qquad \mathbf{o}_t = \sum_{i \in \mathrm{Topk}(\mathbf{p}_t)} \mathbf{p}_{t,i}\mathrm{E}_i(\mathbf{h}_t)

Ling models (Ling-Lite, Ling-Plus) scale from 16.8B to 290B parameters, retaining strong efficiency and competitive accuracy by sparse routing and fine-grained expert specialization (Team et al., 7 Mar 2025).

2. Efficiency, Scaling Laws, and Training Pipeline

Sparsity-Driven Efficiency Leverage

The Ling-1T model empirically achieves >7× active-compute efficiency compared to dense architectures. The efficiency leverage (EL) scaling law for MoE is formalized as:

EL(A,G,C)=A^α+γ(logG)2+βlogG\text{EL}(A, G, C) = \hat A^{\alpha + \gamma(\log G)^2 + \beta \log G}

where AA is the activation ratio, GG is expert granularity, and CC is the compute budget in FLOPs. With Ling-1T’s (A ≈ 3.5%, G=8), this predicts the empirically validated efficiency advantage.

Data Mixtures and Reasoning Optimization

  • Pre-training Data: Heavy infusion of task-specialized mathematical and code reasoning data (Ling Math, Ling Code datasets), which increase from 32% to 46% of the corpus during training.
  • Multi-stage Curriculum: General pre-training (20T tokens), then mid-training with extended contexts (up to 128K tokens) and explicit Chain-of-Thought (CoT) samples.
  • MTP (Multi-Token Prediction): Auxiliary head/loss for predicting multiple tokens, boosting reasoning accuracy (loss weight 0.1).
  • DFT/Evo-CoT: Decoupled fine-tuning and evolutionary RL (Evo-CoT), dual-mode supervised fine-tuning for instant responses and in-depth reasoning.
  • FP8 Deep Quantization: All activations and gradients in FP8 (per-channel statistics), trading a <0.25% loss in accuracy for a 15% gain in hardware throughput and reduced memory.

3. System and Infrastructure Co-Design

Distributed Training and Elastic Resource Utilization

  • EDiT: Elastic distributed local SGD with time-based synchronization and gradient penalty to eliminate stragglers and anomalies, delivering up to 66.1% speedup over standard synchronous training (Team et al., 7 Mar 2025).
  • Custom File Caching (PCache): All-flash distributed caching, user-space FUSE, distributed checkpoint writing; linear scaling of throughput with accelerator count.
  • Cross-Cluster Sync (Babel): Aggressive metadata prefetching and sampled CRC verification, reducing large-scale initialization overheads (e.g., 190M files from >6 hours to ≈10 minutes).
  • XPUTimer: Ultra-light runtime tracer for anomaly/bottleneck detection, O(1) memory per step.

Knowledge Graph Data for Tool Use

  • Synthetic Data: 14 knowledge graph subgraph patterns, first-order logic expansions, >30K instruction templates including real and synthetic API tasks.
  • Tool Use Training: Reasoned chaining, multi-hop tool selection, argument generation in realistic agent scenarios; Ling-Plus achieves benchmark-leading scores on function-calling, chaining, and external API benchmarks.

4. Empirical Results and Benchmarks

Ling-1T attains state-of-the-art accuracy per FLOP across math, code, logic, and knowledge benchmarks:

Model MATH HumanEval ToolBench Long-Context Retrieval
Ling-Lite 73 83 Best/Near 64K tokens
Ling-Plus 79 Best Best 64K tokens
Ling-1T SOTA SOTA SOTA 128K tokens
  • Reasoning: Leaderboards on MATH, CollegeMath, MinervaMath, HumanEval, MultiPL-E, OptiBench, AIME24/25, ARC-e/c.
  • Efficiency: Matches or exceeds models with equivalent activated FLOPs that require dense 1T-parameter computation.
  • Safety: Balanced helpfulness/harmlessness; competitive scores in refusal and safety tuning.

5. Hardware and Cost Implications

  • Commodity Accelerators: Designed and validated for training on hardware <120–370 TFLOPS, 64–96GB RAM, rather than premium H100/H800; democratizing trillion-parameter scaling.
  • Cost Savings: Training Ling-Plus (290B) on lower-spec hardware reduces costs by ≈20% (e.g., from 6.35M RMB to 5.08M RMB per 1T tokens).

6. The Ling-1T Adder: Hardware Perspective

In computer arithmetic, Ling-1T also refers to a hardware architecture for binary addition—specifically, the high-speed binary Ling adder (Gupta, 2019). Principal features:

  • Carry Computation: Employs adjacent bits (gi,pi,dig_i, p_i, d_i), where gi=aibig_i = a_i \cdot b_i, pi=ai+bip_i = a_i + b_i, di=aibid_i = a_i \oplus b_i. Ling adder introduces “half-sum” bits for logic minimization.
  • Logic Depth: Reduces logic levels for 4-bit addition (Ling: 4; CLA: 5), improving propagation speed.
  • Ripple Reduction: Minimizes dependency on previous carries, thus lowering cumulative delay for wide adders.
  • Complexity: Circuit complexity grows for large n; practical for moderate-bit-width, less favorable for very high-bit VLSI where other adders may dominate.

7. Summary Table: Ling-1T Key Specifications

Attribute Value
Total Params (LLM) 1T
Active per Forward 51B
MoE Experts per Layer 256 (8+1 routed)
Efficiency Leverage >7× dense
Precision FP8 end-to-end
Context Window (max) 128K tokens
Hardware Cost Advantage ≈20% savings (sub-premium)
Benchmark Leadership Math, Code, Logic, Tool Use
Ling Adder Logic Levels 4 (vs. 5 for CLA)

8. Impact, Implications, and Future Directions

Ling-1T (LLM) establishes a new Pareto frontier, representing the highest known reasoning accuracy per computational cost at the trillion-parameter scale. The open foundation, co-designed scalable architecture, and tooling for democratized training underpin a robust blueprint for next-generation “thinking” AI (e.g., future Ring series).

The Ling adder, by contrast, contributes to hardware arithmetic design, particularly in contexts demanding minimized addition latency and where gate-count increases can be accepted for speed.

Both lines demonstrate scalable efficiency gains by architectural, routing, and data innovation—whether in digital logic or LLM design. The Ling-1T family serves as a reproducible base for high-efficiency language reasoning and for high-speed hardware addition circuits.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Ling-1T.