TeleChat3-MoE: Scalable MoE LLM

Updated 3 January 2026

TeleChat3-MoE is a series of large language models employing a Mixture-of-Experts architecture that activates a subset of hundreds of experts per token for efficient computation.
The system integrates advanced parallelism strategies, including data, tensor, pipeline, and expert parallelism, to achieve near-linear scaling and significant throughput improvements.
Extensive training on Huawei Ascend NPU clusters with systematic accuracy protocols and operator fusion leads to effective scaling from 105 billion to over a trillion parameters.

TeleChat3-MoE is a series of LLMs employing a Mixture-of-Experts (MoE) architecture, with parameter scales ranging from 105 billion to over one trillion. The models are trained end-to-end on Huawei’s Ascend NPU clusters and are underpinned by a systematized training infrastructure. The technical advancements facilitate reliable scaling to the frontier sizes required for state-of-the-art LLMs. TeleChat3-MoE integrates architectural innovations, comprehensive accuracy verification protocols, a suite of performance optimizations, and advanced parallelization methodologies to achieve near-linear scaling efficiency on clusters with thousands of devices (Liu et al., 30 Dec 2025).

1. Model Architecture

The core of TeleChat3-MoE is an MoE backbone within an encoder–decoder or decoder-only stack comprising $L \approx 45$ –$60$ layers (depending on scale). Each layer incorporates Multi-Latent Attention (MLA) blocks, feed-forward networks (FFN), and a dedicated MoE block. MoE layers instantiate $E$ experts per layer (typically in the hundreds), but only the top $k=4$ –8 experts are activated per token. A fallback shared expert is always available.

The expert routing employs a lightweight linear gating network $W_g \in \mathbb{R}^{E \times d}$ , which calculates logits for expert selection from input representation $x \in \mathbb{R}^d$ :

$P_i(x) = \operatorname{softmax}_i(W_gx) = \frac{\exp(w_{g,i}^T x)}{\sum_{j=1}^E \exp(w_{g,j}^T x)}.$

Top- $k$ routing enforces sparsity by zeroing out all but the $k$ highest routing probabilities. Expert parameters in a layer are estimated as:

$P_{\text{layer}}^{\mathrm{MoE}} \approx E \cdot (d\,h_{ff} + h_{ff}\,d),$

where $d$ is hidden size and $h_{ff}$ is the expert intermediate dimension. For non-MoE models, the corresponding dense FFN parameter count would be $2d\,d_{ff}$ . The total model parameter count $P$ is

$P \approx L \cdot [P_{\mathrm{MLA}}(d) + P_{\mathrm{MoE}}(d, h_{ff}, E)] + P_{\mathrm{embed}} + P_{\mathrm{head}},$

with $P_{\mathrm{MoE}} = E \cdot (2d\,h_{ff})$ , and $P_{\mathrm{MLA}} \approx 4d^2$ describing attention projections.

2. Model Scale and Memory Footprint

TeleChat3-MoE variants scale from $1.05 \times 10^{11}$ to over $1 \times 10^{12}$ parameters by adjusting:

Hidden size $d$ : $5,000$ to $20,000$
Layers $L$ : $45$ to $60+$
Experts per layer $E$ : Hundreds
Expert intermediate size $h_{ff}$ : Approximately $3$–$4$ times $d$

Memory usage per training run is decomposed as follows: $\begin{align*} M_{\text{params}} &= P \cdot S_p \ M_{\text{optim}} &= k_{\mathrm{opt}} \cdot P \cdot S_p \ M_{\text{activ}} &= B \cdot S \cdot d \cdot S_p \ M_{\text{total}} &\approx (1 + k_{\mathrm{opt}}) P S_p + B S d S_p \end{align*}$ where $S_p$ is bytes per parameter (2 for FP16, 4 for FP32), $k_{\text{opt}}$ is optimizer state multiplier (e.g. $2$ for Adam), $B$ is batch size, $S$ is sequence length.

3. Training Infrastructure and Accuracy Protocols

TeleChat3-MoE is trained on Ascend NPU clusters, leveraging both Atlas 900 and Ascend 910B hardware orchestrated via the MindSpore framework. Operator-level numerical verification is enforced by comparing to a golden CPU baseline in FP32/FP64. Tolerance $\Delta$ is set based on accumulation counts $N_{\text{acc}}$ , with representative thresholds:

$N_{\text{acc}} < 2{,}000$ : $\Delta \approx 4 \times 10^{-3}$
$N_{\text{acc}} < 1 \times 10^6$ : $\Delta \approx 0.1$

Inputs are clipped to avoid division instability. End-to-end alignment involves systematic comparison of forward losses and backward gradients under cross-hardware and cross-parallelism settings, starting from single-device, tiny-model runs and scaling up; discrepancies are isolated by tensor dumping at specified checkpoints. Verification processes are outlined in Figure 1 and Table 3 of the source (Liu et al., 30 Dec 2025).

4. Parallelism and Work Scheduling Strategies

TeleChat3-MoE employs multi-dimensional parallelism configurations:

Data Parallelism (DP)
Tensor Parallelism (TP)
Pipeline Parallelism (PP)
Sequence Parallelism (SP)
Expert Parallelism (EP)
Optimizer-state Parallelism (OP)

Pipeline scheduling is optimized via interleaved layer allocation and a 1F1B execution schedule (one-forward/one-backward), which overlaps communication and computation. This arrangement yields a +10% throughput improvement over naive pipeline scheduling.

For very long sequences (up to 128,000 tokens), an attention-aware data scheduling approach sorts and shards samples by document length cost, balancing the sparse-attention computational load across devices.

Expert parallelism utilizes a hierarchical communication structure: inter-node AllGather, followed by local filtering and then intra-node All-to-All. This yields approximately 15% throughput gain at EP degree 16 compared to single global All-to-All. Communication overlap is achieved by partitioning batch and sequence dimensions to permit concurrent EP communication (RoCE AllGather + HCCS All-to-All) and FFN computation, reducing EP comm time from 30% to 5% of total.

Pipeline layout optimization is posed as an Integer Linear Programming (ILP) problem. Given a parallelism strategy (DP, TP, PP, EP, MB), ILP variables $x_{\text{layer, stage}}$ and recompute flags $r_i$ are selected to minimize:

$T_{\text{total}} = T_{\text{compute}} + T_{\text{bubble}}(x) + T_{\text{comm}}(x)$

subject to memory constraints $M_{\text{peak}}(x) \leq M_{\text{device}}^{\max}$ . This reduces manual configuration from approximately 7 days to 0.5 days and is found to match or outperform expert-tuned baselines (39.97 ms vs. 40.08 ms step time on 4096 devices).

5. Performance Engineering and Optimizations

Hierarchical and overlapped expert parallelism communication provides up to 15% throughput improvement and cuts EP communication cost to 5% of total comm time.

DVM-based operator fusion consolidates "Cube" operators (GroupedMatMul) and "Vector" operators (such as Reshape, Cast) into fused composite operators. For instance, fusing a $[20 \times 5120 \times 3072]$ MatMul–Reshape–Cast sequence results in an 85% speedup.

Cluster-level optimizations include:

Host-bound regime: CPU affinity and kernel isolation decrease per-node variance by 38% and increase mean throughput up to 10–15% for large configurations.
Device-bound regime: Raising the firmware threshold for NPU idle mode increases throughput by 25–30% on 4096 NPUs.
IOMMU passthrough: Enables additional 3–5% performance improvement through enhanced monitoring paths.

A summary of throughput improvements is given below:

Optimization	Improvement
Pipeline+1F1B	+10% throughput
Hierarchical EP	+15% throughput
EP overlap	EP comm 30%→5%
Operator fusion	+85% (single-op)
Cluster firmware tweaks	+25–30% throughput
Host isolation	+10–15% throughput
IOMMU passthrough	+3–5% throughput

6. Empirical Results and Scaling

TeleChat3-MoE demonstrates near-linear scaling from hundreds up to approximately 8192 NPUs, including training runs of over one trillion parameters. On a 4096-device run (438B model, $B_{\text{global}}=16,384$ ), example step times are:

Expert-1: 40.08 ms
Expert-2: 40.15 ms
Tool-generated (ILP-optimized): 39.97 ms

End-to-end improvements against unoptimized baselines include:

Pipeline+1F1B: +10% throughput
Hierarchical EP: +15%
EP overlap: EP comm reduced from 30% to 5% of total
Operator fusion: +85% (single-op)
Cluster firmware optimizations: +25–30%
Host isolation: +10–15%
IOMMU passthrough: +3–5%

These outcomes demonstrate the effectiveness of cohesive architectural, infrastructural, and parallelization design for MoE LLMs at trillion-parameter scale on specialized NPU clusters (Liu et al., 30 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Training Report of TeleChat3-MoE (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TeleChat3-MoE.