Papers
Topics
Authors
Recent
Search
2000 character limit reached

TeleChat3-MoE: Scalable MoE LLM

Updated 3 January 2026
  • TeleChat3-MoE is a series of large language models employing a Mixture-of-Experts architecture that activates a subset of hundreds of experts per token for efficient computation.
  • The system integrates advanced parallelism strategies, including data, tensor, pipeline, and expert parallelism, to achieve near-linear scaling and significant throughput improvements.
  • Extensive training on Huawei Ascend NPU clusters with systematic accuracy protocols and operator fusion leads to effective scaling from 105 billion to over a trillion parameters.

TeleChat3-MoE is a series of LLMs employing a Mixture-of-Experts (MoE) architecture, with parameter scales ranging from 105 billion to over one trillion. The models are trained end-to-end on Huawei’s Ascend NPU clusters and are underpinned by a systematized training infrastructure. The technical advancements facilitate reliable scaling to the frontier sizes required for state-of-the-art LLMs. TeleChat3-MoE integrates architectural innovations, comprehensive accuracy verification protocols, a suite of performance optimizations, and advanced parallelization methodologies to achieve near-linear scaling efficiency on clusters with thousands of devices (Liu et al., 30 Dec 2025).

1. Model Architecture

The core of TeleChat3-MoE is an MoE backbone within an encoder–decoder or decoder-only stack comprising L45L \approx 45–$60$ layers (depending on scale). Each layer incorporates Multi-Latent Attention (MLA) blocks, feed-forward networks (FFN), and a dedicated MoE block. MoE layers instantiate EE experts per layer (typically in the hundreds), but only the top k=4k=4–8 experts are activated per token. A fallback shared expert is always available.

The expert routing employs a lightweight linear gating network WgRE×dW_g \in \mathbb{R}^{E \times d}, which calculates logits for expert selection from input representation xRdx \in \mathbb{R}^d:

Pi(x)=softmaxi(Wgx)=exp(wg,iTx)j=1Eexp(wg,jTx).P_i(x) = \operatorname{softmax}_i(W_gx) = \frac{\exp(w_{g,i}^T x)}{\sum_{j=1}^E \exp(w_{g,j}^T x)}.

Top-kk routing enforces sparsity by zeroing out all but the kk highest routing probabilities. Expert parameters in a layer are estimated as:

PlayerMoEE(dhff+hffd),P_{\text{layer}}^{\mathrm{MoE}} \approx E \cdot (d\,h_{ff} + h_{ff}\,d),

where dd is hidden size and hffh_{ff} is the expert intermediate dimension. For non-MoE models, the corresponding dense FFN parameter count would be 2ddff2d\,d_{ff}. The total model parameter count PP is

PL[PMLA(d)+PMoE(d,hff,E)]+Pembed+Phead,P \approx L \cdot [P_{\mathrm{MLA}}(d) + P_{\mathrm{MoE}}(d, h_{ff}, E)] + P_{\mathrm{embed}} + P_{\mathrm{head}},

with PMoE=E(2dhff)P_{\mathrm{MoE}} = E \cdot (2d\,h_{ff}), and PMLA4d2P_{\mathrm{MLA}} \approx 4d^2 describing attention projections.

2. Model Scale and Memory Footprint

TeleChat3-MoE variants scale from 1.05×10111.05 \times 10^{11} to over 1×10121 \times 10^{12} parameters by adjusting:

  • Hidden size dd: $5,000$ to $20,000$
  • Layers LL: $45$ to $60+$
  • Experts per layer EE: Hundreds
  • Expert intermediate size hffh_{ff}: Approximately $3$–$4$ times dd

Memory usage per training run is decomposed as follows: Mparams=PSp Moptim=koptPSp Mactiv=BSdSp Mtotal(1+kopt)PSp+BSdSp\begin{align*} M_{\text{params}} &= P \cdot S_p \ M_{\text{optim}} &= k_{\mathrm{opt}} \cdot P \cdot S_p \ M_{\text{activ}} &= B \cdot S \cdot d \cdot S_p \ M_{\text{total}} &\approx (1 + k_{\mathrm{opt}}) P S_p + B S d S_p \end{align*} where SpS_p is bytes per parameter (2 for FP16, 4 for FP32), koptk_{\text{opt}} is optimizer state multiplier (e.g. $2$ for Adam), BB is batch size, SS is sequence length.

3. Training Infrastructure and Accuracy Protocols

TeleChat3-MoE is trained on Ascend NPU clusters, leveraging both Atlas 900 and Ascend 910B hardware orchestrated via the MindSpore framework. Operator-level numerical verification is enforced by comparing to a golden CPU baseline in FP32/FP64. Tolerance Δ\Delta is set based on accumulation counts NaccN_{\text{acc}}, with representative thresholds:

  • Nacc<2,000N_{\text{acc}} < 2{,}000: Δ4×103\Delta \approx 4 \times 10^{-3}
  • Nacc<1×106N_{\text{acc}} < 1 \times 10^6: Δ0.1\Delta \approx 0.1

Inputs are clipped to avoid division instability. End-to-end alignment involves systematic comparison of forward losses and backward gradients under cross-hardware and cross-parallelism settings, starting from single-device, tiny-model runs and scaling up; discrepancies are isolated by tensor dumping at specified checkpoints. Verification processes are outlined in Figure 1 and Table 3 of the source (Liu et al., 30 Dec 2025).

4. Parallelism and Work Scheduling Strategies

TeleChat3-MoE employs multi-dimensional parallelism configurations:

Pipeline scheduling is optimized via interleaved layer allocation and a 1F1B execution schedule (one-forward/one-backward), which overlaps communication and computation. This arrangement yields a +10% throughput improvement over naive pipeline scheduling.

For very long sequences (up to 128,000 tokens), an attention-aware data scheduling approach sorts and shards samples by document length cost, balancing the sparse-attention computational load across devices.

Expert parallelism utilizes a hierarchical communication structure: inter-node AllGather, followed by local filtering and then intra-node All-to-All. This yields approximately 15% throughput gain at EP degree 16 compared to single global All-to-All. Communication overlap is achieved by partitioning batch and sequence dimensions to permit concurrent EP communication (RoCE AllGather + HCCS All-to-All) and FFN computation, reducing EP comm time from 30% to 5% of total.

Pipeline layout optimization is posed as an Integer Linear Programming (ILP) problem. Given a parallelism strategy (DP, TP, PP, EP, MB), ILP variables xlayer, stagex_{\text{layer, stage}} and recompute flags rir_i are selected to minimize:

Ttotal=Tcompute+Tbubble(x)+Tcomm(x)T_{\text{total}} = T_{\text{compute}} + T_{\text{bubble}}(x) + T_{\text{comm}}(x)

subject to memory constraints Mpeak(x)MdevicemaxM_{\text{peak}}(x) \leq M_{\text{device}}^{\max}. This reduces manual configuration from approximately 7 days to 0.5 days and is found to match or outperform expert-tuned baselines (39.97 ms vs. 40.08 ms step time on 4096 devices).

5. Performance Engineering and Optimizations

Hierarchical and overlapped expert parallelism communication provides up to 15% throughput improvement and cuts EP communication cost to 5% of total comm time.

DVM-based operator fusion consolidates "Cube" operators (GroupedMatMul) and "Vector" operators (such as Reshape, Cast) into fused composite operators. For instance, fusing a [20×5120×3072][20 \times 5120 \times 3072] MatMul–Reshape–Cast sequence results in an 85% speedup.

Cluster-level optimizations include:

  • Host-bound regime: CPU affinity and kernel isolation decrease per-node variance by 38% and increase mean throughput up to 10–15% for large configurations.
  • Device-bound regime: Raising the firmware threshold for NPU idle mode increases throughput by 25–30% on 4096 NPUs.
  • IOMMU passthrough: Enables additional 3–5% performance improvement through enhanced monitoring paths.

A summary of throughput improvements is given below:

Optimization Improvement
Pipeline+1F1B +10% throughput
Hierarchical EP +15% throughput
EP overlap EP comm 30%→5%
Operator fusion +85% (single-op)
Cluster firmware tweaks +25–30% throughput
Host isolation +10–15% throughput
IOMMU passthrough +3–5% throughput

6. Empirical Results and Scaling

TeleChat3-MoE demonstrates near-linear scaling from hundreds up to approximately 8192 NPUs, including training runs of over one trillion parameters. On a 4096-device run (438B model, Bglobal=16,384B_{\text{global}}=16,384), example step times are:

  • Expert-1: 40.08 ms
  • Expert-2: 40.15 ms
  • Tool-generated (ILP-optimized): 39.97 ms

End-to-end improvements against unoptimized baselines include:

  • Pipeline+1F1B: +10% throughput
  • Hierarchical EP: +15%
  • EP overlap: EP comm reduced from 30% to 5% of total
  • Operator fusion: +85% (single-op)
  • Cluster firmware optimizations: +25–30%
  • Host isolation: +10–15%
  • IOMMU passthrough: +3–5%

These outcomes demonstrate the effectiveness of cohesive architectural, infrastructural, and parallelization design for MoE LLMs at trillion-parameter scale on specialized NPU clusters (Liu et al., 30 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TeleChat3-MoE.