TeleChat3-MoE: Scalable MoE LLM
- TeleChat3-MoE is a series of large language models employing a Mixture-of-Experts architecture that activates a subset of hundreds of experts per token for efficient computation.
- The system integrates advanced parallelism strategies, including data, tensor, pipeline, and expert parallelism, to achieve near-linear scaling and significant throughput improvements.
- Extensive training on Huawei Ascend NPU clusters with systematic accuracy protocols and operator fusion leads to effective scaling from 105 billion to over a trillion parameters.
TeleChat3-MoE is a series of LLMs employing a Mixture-of-Experts (MoE) architecture, with parameter scales ranging from 105 billion to over one trillion. The models are trained end-to-end on Huawei’s Ascend NPU clusters and are underpinned by a systematized training infrastructure. The technical advancements facilitate reliable scaling to the frontier sizes required for state-of-the-art LLMs. TeleChat3-MoE integrates architectural innovations, comprehensive accuracy verification protocols, a suite of performance optimizations, and advanced parallelization methodologies to achieve near-linear scaling efficiency on clusters with thousands of devices (Liu et al., 30 Dec 2025).
1. Model Architecture
The core of TeleChat3-MoE is an MoE backbone within an encoder–decoder or decoder-only stack comprising –$60$ layers (depending on scale). Each layer incorporates Multi-Latent Attention (MLA) blocks, feed-forward networks (FFN), and a dedicated MoE block. MoE layers instantiate experts per layer (typically in the hundreds), but only the top –8 experts are activated per token. A fallback shared expert is always available.
The expert routing employs a lightweight linear gating network , which calculates logits for expert selection from input representation :
Top- routing enforces sparsity by zeroing out all but the highest routing probabilities. Expert parameters in a layer are estimated as:
where is hidden size and is the expert intermediate dimension. For non-MoE models, the corresponding dense FFN parameter count would be . The total model parameter count is
with , and describing attention projections.
2. Model Scale and Memory Footprint
TeleChat3-MoE variants scale from to over parameters by adjusting:
- Hidden size : $5,000$ to $20,000$
- Layers : $45$ to $60+$
- Experts per layer : Hundreds
- Expert intermediate size : Approximately $3$–$4$ times
Memory usage per training run is decomposed as follows: where is bytes per parameter (2 for FP16, 4 for FP32), is optimizer state multiplier (e.g. $2$ for Adam), is batch size, is sequence length.
3. Training Infrastructure and Accuracy Protocols
TeleChat3-MoE is trained on Ascend NPU clusters, leveraging both Atlas 900 and Ascend 910B hardware orchestrated via the MindSpore framework. Operator-level numerical verification is enforced by comparing to a golden CPU baseline in FP32/FP64. Tolerance is set based on accumulation counts , with representative thresholds:
- :
- :
Inputs are clipped to avoid division instability. End-to-end alignment involves systematic comparison of forward losses and backward gradients under cross-hardware and cross-parallelism settings, starting from single-device, tiny-model runs and scaling up; discrepancies are isolated by tensor dumping at specified checkpoints. Verification processes are outlined in Figure 1 and Table 3 of the source (Liu et al., 30 Dec 2025).
4. Parallelism and Work Scheduling Strategies
TeleChat3-MoE employs multi-dimensional parallelism configurations:
- Data Parallelism (DP)
- Tensor Parallelism (TP)
- Pipeline Parallelism (PP)
- Sequence Parallelism (SP)
- Expert Parallelism (EP)
- Optimizer-state Parallelism (OP)
Pipeline scheduling is optimized via interleaved layer allocation and a 1F1B execution schedule (one-forward/one-backward), which overlaps communication and computation. This arrangement yields a +10% throughput improvement over naive pipeline scheduling.
For very long sequences (up to 128,000 tokens), an attention-aware data scheduling approach sorts and shards samples by document length cost, balancing the sparse-attention computational load across devices.
Expert parallelism utilizes a hierarchical communication structure: inter-node AllGather, followed by local filtering and then intra-node All-to-All. This yields approximately 15% throughput gain at EP degree 16 compared to single global All-to-All. Communication overlap is achieved by partitioning batch and sequence dimensions to permit concurrent EP communication (RoCE AllGather + HCCS All-to-All) and FFN computation, reducing EP comm time from 30% to 5% of total.
Pipeline layout optimization is posed as an Integer Linear Programming (ILP) problem. Given a parallelism strategy (DP, TP, PP, EP, MB), ILP variables and recompute flags are selected to minimize:
subject to memory constraints . This reduces manual configuration from approximately 7 days to 0.5 days and is found to match or outperform expert-tuned baselines (39.97 ms vs. 40.08 ms step time on 4096 devices).
5. Performance Engineering and Optimizations
Hierarchical and overlapped expert parallelism communication provides up to 15% throughput improvement and cuts EP communication cost to 5% of total comm time.
DVM-based operator fusion consolidates "Cube" operators (GroupedMatMul) and "Vector" operators (such as Reshape, Cast) into fused composite operators. For instance, fusing a MatMul–Reshape–Cast sequence results in an 85% speedup.
Cluster-level optimizations include:
- Host-bound regime: CPU affinity and kernel isolation decrease per-node variance by 38% and increase mean throughput up to 10–15% for large configurations.
- Device-bound regime: Raising the firmware threshold for NPU idle mode increases throughput by 25–30% on 4096 NPUs.
- IOMMU passthrough: Enables additional 3–5% performance improvement through enhanced monitoring paths.
A summary of throughput improvements is given below:
| Optimization | Improvement |
|---|---|
| Pipeline+1F1B | +10% throughput |
| Hierarchical EP | +15% throughput |
| EP overlap | EP comm 30%→5% |
| Operator fusion | +85% (single-op) |
| Cluster firmware tweaks | +25–30% throughput |
| Host isolation | +10–15% throughput |
| IOMMU passthrough | +3–5% throughput |
6. Empirical Results and Scaling
TeleChat3-MoE demonstrates near-linear scaling from hundreds up to approximately 8192 NPUs, including training runs of over one trillion parameters. On a 4096-device run (438B model, ), example step times are:
- Expert-1: 40.08 ms
- Expert-2: 40.15 ms
- Tool-generated (ILP-optimized): 39.97 ms
End-to-end improvements against unoptimized baselines include:
- Pipeline+1F1B: +10% throughput
- Hierarchical EP: +15%
- EP overlap: EP comm reduced from 30% to 5% of total
- Operator fusion: +85% (single-op)
- Cluster firmware optimizations: +25–30%
- Host isolation: +10–15%
- IOMMU passthrough: +3–5%
These outcomes demonstrate the effectiveness of cohesive architectural, infrastructural, and parallelization design for MoE LLMs at trillion-parameter scale on specialized NPU clusters (Liu et al., 30 Dec 2025).