Papers
Topics
Authors
Recent
Search
2000 character limit reached

Training Report of TeleChat3-MoE

Published 30 Dec 2025 in cs.CL | (2512.24157v1)

Abstract: TeleChat3-MoE is the latest series of TeleChat LLMs, featuring a Mixture-of-Experts (MoE) architecture with parameter counts ranging from 105 billion to over one trillion,trained end-to-end on Ascend NPU cluster. This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes. We detail systematic methodologies for operator-level and end-to-end numerical accuracy verification, ensuring consistency across hardware platforms and distributed parallelism strategies. Furthermore, we introduce a suite of performance optimizations, including interleaved pipeline scheduling, attention-aware data scheduling for long-sequence training,hierarchical and overlapped communication for expert parallelism, and DVM-based operator fusion. A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations. Additionally, we present methodological approaches to cluster-level optimizations, addressing host- and device-bound bottlenecks during large-scale training tasks. These infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale LLM development on hardware ecosystems.

Summary

  • The paper presents a robust full-stack training infrastructure for trillion-scale MoE models, emphasizing aggressive sparsity and hardware-aware optimizations.
  • The methodology incorporates interleaved pipeline scheduling, attention-aware data management, and hierarchical expert parallelism for superior throughput.
  • Systematic numerical accuracy validation and automatic ILP-based parallelism ensure reproducible, efficient scaling across thousands of devices.

TeleChat3-MoE: Training Infrastructure and Optimization at Trillion-Scale

Model Design and Architectural Refinements

The TeleChat3-MoE model series encompasses parameter counts from 105B to over 1T, leveraging a Mixture-of-Experts (MoE) architecture specifically tuned for hardware affinity and distributed scaling. Notable architectural features include Multi-Latent Attention (MLA), which compresses KV vectors into lower-dimensional latent representations, substantially reducing KV cache footprint and elevating compute-to-memory access ratios for long-context inference. The adoption of a shallow-and-wide topology—with fewer layers but expanded hidden dimensions—further boosts per-layer arithmetic intensity and mitigates pipeline bubble overhead during distributed training. Aggressive MoE sparsity (top-4 to top-8 expert routing with one shared expert) enables low activation ratios, reducing memory footprint and maximizing device concurrency, especially in inference scenarios. These choices support efficient scaling from modest clusters for the 105B model up to 8192 devices for trillion-parameter variants.

Systematic Numerical Accuracy Verification

A comprehensive framework for numerical accuracy verification is implemented, starting with operator-wise checks. Systematic baselining utilizes high-precision CPU reference implementations, with strict tolerance thresholds dynamically set against the accumulation count (e.g., 0.0001 for float32 under 2,000 accumulations, 0.004 for float16). Inputs are rigorously clipped to avoid extreme values that would trigger error amplification in division-heavy operators, ensuring that operator-level discrepancies are tightly localized and controlled.

End-to-end model accuracy alignment is performed during cross-hardware migration and parallelism strategy transitions, employing progressive scaling verification and layer-wise tensor dumps. This workflow efficiently isolates divergence in loss or gradient trajectories attributable to back-end-specific handling (e.g., optimizer implementation inconsistencies or framework-specific randomness). Precise numerical equivalence is validated across parallelism strategies (DP, TP, PP, SP, EP), with gating clusters used to systematically compare activations and optimizer states, ensuring reproducibility and training reliability at scale.

Training Framework Performance Optimization

Extensive optimizations target the MindSpore training framework for maximal throughput and utilization:

  • Interleaved Pipeline Scheduling & 1F1B Overlap: Moving beyond contiguous layer assignment, pipeline interleaving distributes layers in a non-contiguous fashion, minimizing pipeline bubble overhead and enabling computation-communication overlap. The 1F1B strategy, overlapping forward and backward micro-batches, improves E2E training throughput by ~10% over conventional pipeline approaches.
  • Attention-Aware Data Scheduling: For long sequences (128K tokens), an attention-aware micro-batch scheduler balances computation by redistributing samples based on sub-sequence length, mitigating device idle time and improving sparse attention efficiency.
  • Hierarchical Expert Parallelism Communication: Hierarchical AllGather followed by local filtering and intra-node All-to-All significantly reduces redundant communication in MoE training, achieving ~15% higher throughput under typical EP degrees.
  • Communication Overlapping for Expert Parallelism: Multi-dimensional data partitioning enables overlapping between EP communication and computation, reducing EP comm time from 30% to 5% of total comm overhead.
  • DVM-Based Operator Fusion: Automated fusion of memory-bound and compute-bound operators (Vector and Cube class) enhances compute unit utilization and reduces kernel launches, yielding up to 85% speedup on large fused operator sequences.

Systematic Parallelization Framework

The manual exploration of multi-dimensional parallelism (DP, TP, PP, VPP, SP, EP, OP) is replaced by an analytic toolchain incorporating integer linear programming (ILP). This framework parses model configs, generates and ranks candidate strategies with symbolic memory and performance estimation, then applies ILP to optimize pipeline stage assignment, interleaving, and recomputation under memory constraints. Tuning duration is reduced from seven days to 0.5 days, with throughput on 4096 NPUs matching or exceeding manual expert-designed baselines. The tool enables broader and deeper search of the parallelism configuration space, resulting in more balanced and memory-efficient schedules.

Cluster-Level Engineering and Firmware Optimization

Cluster-level optimizations address both host- and device-bound bottlenecks:

  • Host-Bound Issues: Through resource isolation (CPU affinity, partitioning, kernel-level control), fluctuations induced by monitoring and process contention are suppressed—variance reduced by 38%, throughput gains of up to 15% on large clusters.
  • Device-Bound Issues: Firmware-level tuning circumvents the Ascend NPU’s mis-triggered idle frequency scaling for short-duration operators, restoring chip clock under active workloads and delivering 25–30% throughput improvements.
  • Monitoring-Induced Overheads: Transition to passthrough IOMMU configs eliminates query-induced host-device latency spikes, yielding an additional 3–5% throughput gain.

This methodological approach ensures reliable scaling and consistent performance across diverse hardware ecosystems.

Empirical Results and Contradictory Claims

The infrastructure delivers near-linear scaling across thousands of devices for both moderate (105B) and extreme (1T+) parameter models. Strong numerical results include:

  • 85% operator-level speedup from DVM-based fusion on GroupedMatMul-Reshape-Cast sequences
  • EP communication time reduction from 30% to 5% of total comm under overlapping
  • Sustainable training throughput improvements of 10–30% after cluster-level optimizations
  • Automated parallelization achieving step times indistinguishable from or better than manual baseline (39,969 ms vs. 40,076/40,147 ms on 4096 devices)

The TeleChat3-MoE design demonstrates that aggressive MoE sparsity, combined with infrastructural sophistication, can outperform lower-sparsity MoE models on end-to-end throughput and utilization—even at extreme parameter scales. This directly contradicts traditional assumptions that only moderate-sparsity regimes are feasible at such scales.

Implications and Future Directions

The work establishes a robust training solution tailored to large NPU clusters, advancing infrastructure reproducibility, engineering efficiency, and utilization. Open sourcing both models and infrastructure positions the community for rapid scaling and benchmarking beyond the current trillion-parameter frontier. The adoption of automatic parallelization tooling and systematic accuracy verification is likely to become standard practice in the development of even larger MoE-based architectures.

Practical implications include reduction of tuning effort, improvement of cluster stability, and enablement of stable training runs at record-breaking scales. For theory, the findings suggest new opportunities in sparse expert model optimization, distributed communication design, and hardware-aware transformer architectures.

Future research may further exploit the synergies of even higher sparsity MoE, longer context attention mechanisms, adaptive operator fusion schemes, and real-time firmware controls. These directions point toward robust, generalizable, and energy-efficient LLMs deployable on heterogeneous clusters worldwide.

Conclusion

TeleChat3-MoE’s training infrastructure demonstrates a mature, full-stack approach to scaling LLMs with MoE at unprecedented parameter counts. Through rigorous accuracy validation, advanced training framework optimizations, systematic parallelism strategy generation, and methodical cluster engineering, the work delivers a reproducible and efficient foundation for future large-scale LLM research. The open-sourced release invites further exploration and accelerated progress in both practical deployment and theoretical understanding of distributed MoE architectures.

Reference: "Training Report of TeleChat3-MoE" (2512.24157)

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about (big picture)

This paper explains how a team built and trained super–LLMs called TeleChat3-MoE. “MoE” stands for Mixture of Experts, which is like having a huge team of specialist mini-models where only a few speak up for each question. The models have 100+ billion to over a trillion “parameters” (think: adjustable dials) and were trained on powerful AI chips called NPUs. The paper focuses less on the model’s skills and more on the behind-the-scenes engineering that makes training models this big fast, stable, and reliable.

What questions the paper tries to answer

  • How can we make sure the math inside such giant models is accurate and consistent when we train across thousands of chips and different machines?
  • How can we speed up training so the chips don’t waste time waiting around?
  • How can we choose the best way to split the model and data across many devices without spending a week guessing and checking?
  • How can we tune the whole computer cluster (not just the model code) so everything runs smoothly together?

How they approached it (in everyday terms)

Think of training as running a giant factory:

  • MoE (Mixture of Experts): Instead of one huge worker doing everything, many specialists (experts) are available. For each sentence, only a few experts are chosen to work. This makes the factory efficient: lots of total workers exist, but only the right few are active at a time.
  • Checking the math carefully:
    • Operator-level checks: An “operator” is a tiny step like add, multiply, or softmax. They compare each operator’s output to a precise “golden” version to make sure small rounding differences don’t snowball into big errors.
    • End-to-end checks: They run the full model step-by-step on different hardware set-ups and make sure the loss (a measure of how wrong the model is) and gradients (the directions used to improve the model) match. If something drifts, they track it down layer-by-layer.
  • Speeding up the assembly line (pipeline parallelism):
    • Picture the model as many stations on a conveyor belt. If one station waits too long, you get “bubbles” (wasted time).
    • They use interleaving (mixing which layers go to which station) and an “1F1B” schedule (one forward pass and one backward pass overlapped) so work and communication happen at the same time, reducing waiting.
  • Balancing long-text work:
    • Some inputs (very long documents) are heavier than others. If one worker gets all the long ones, everyone else waits.
    • They created an “attention-aware” scheduler that spreads the heavy and light cases evenly across devices so nobody becomes a bottleneck.
  • Faster “expert” communication:
    • Experts live on many devices. Moving data to the right expert can be like sending packages across cities.
    • They cut long-distance traffic by doing it in two steps: gather data per machine first, then shuffle it locally. They also overlap communication and computation so sending packages happens while other work continues.
  • Fusing tiny steps:
    • Many tiny operations waste time by going back and forth to memory.
    • They “fuse” several small steps into one combined step, like doing multiple tasks in a single trip, which saves time and memory traffic.
  • Automatically choosing the best parallel strategy:
    • There are many ways to split work: by data, by model layers, by sequences, by experts, and more. Finding the best combo by hand can take a week.
    • They built a tool that uses quick math estimates plus a planning method (integer linear programming) to pick good strategies fast, like using a smart packing algorithm to load a truck efficiently.
  • Tuning the whole cluster:
    • They isolate resources on the host CPUs so training jobs don’t fight with background tasks.
    • They fixed a power-saving quirk on the NPUs that accidentally slowed training by lowering chip speeds mid-job.
    • They adjusted system settings to reduce random slowdowns from monitoring and memory checks.

What they found and why it matters

  • Accuracy stays consistent: With their step-by-step verification process, they matched results across different hardware and training setups. This prevents nasty surprises when scaling up.
  • Faster training:
    • Interleaved pipeline with 1F1B overlap gave about 10% speedup.
    • Smarter scheduling for long texts improved throughput by keeping devices equally busy.
    • Hierarchical and overlapped expert communication cut big communication costs (for example, reducing expert-communication time from about 30% to about 5% of total comms).
    • Operator fusion sped up common patterns by large margins in single operations (around 85% in a highlighted case).
  • Smarter configuration, less guesswork:
    • Their strategy tool reduced tuning time from about 7 days to about half a day, while matching or slightly beating expert-tuned performance.
  • Better cluster performance:
    • Resource isolation and firmware tweaks reduced slowdowns, improved consistency, and increased throughput, especially at very large scales (thousands of devices).
  • Near-linear scaling:
    • As they added more devices, performance increased almost proportionally, which is a big deal for training trillion-parameter models.

What this could mean going forward

  • Training huge models can be reliable and efficient if you treat it like a full-stack problem: math accuracy, smart scheduling, communication tricks, and cluster tuning all matter.
  • The tools they share (models, code, and scheduling systems) can help others build bigger and better LLMs faster, without reinventing everything.
  • The approach makes it more practical to train future models with even longer memory and more experts, opening the door to stronger, more helpful AI systems that can be trained on available hardware without wasting time or energy.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper leaves the following aspects missing, uncertain, or unexplored; each point is framed to suggest concrete directions for future work:

  • Model evaluation: No end-to-end quality results (e.g., perplexity, task benchmarks, long-context performance) for TeleChat3-MoE at 105B–1T+ scales; provide standardized evaluations across reasoning, coding, multilingual, and 32K–128K context tasks.
  • Data transparency: Absent details on pretraining data sources, token counts, filtering, domain mix, deduplication, and long-context corpus construction; publish dataset composition and quality controls to enable reproducibility and scaling-law analysis.
  • Architecture specifics: Table 1 is incomplete—missing precise hyperparameters (number of routed experts, capacity factor, experts-per-token, expert intermediate sizes, shared expert design); release full layer-wise configurations for 105B/438B/1T+ variants.
  • Expert routing and capacity: Unspecified gating strategy (top-k, capacity factor, load-balancing loss, token dropping/backpressure) and its effect on specialization/accuracy; analyze routing stability, expert saturation, and token-drop rates.
  • MLA design and trade-offs: Insufficient description of Multi-Latent Attention (latent dimensionality, compression ratio, training hyperparameters) and quantitative comparison versus GQA/FlashAttention on memory, latency, and accuracy, especially for 128K context.
  • Positional embeddings: “Optimized positional embeddings” for 32K–128K are not defined (e.g., RoPE variants, ALiBi, extrapolation strategies); specify implementation and validate long-context generalization on synthetic and real tasks.
  • Convergence and stability: No analysis of loss dynamics, optimizer choices (betas, weight decay, EMA), gradient scaling/clipping, and failure modes at trillion-scale; report stability metrics and mitigation strategies.
  • Accuracy tolerance methodology: The accumulation-count-based tolerance table is ad hoc and unvalidated across operator families; formalize tolerance derivation, automate accumulation estimation, and show empirical links to training divergence rates.
  • Extreme-value handling: Input clipping during operator tests may mask real precision pathologies; evaluate production-safe treatments (e.g., stabilized formulations, compensated summation) and their runtime impact.
  • End-to-end precision alignment thresholds: Lack quantitative acceptance criteria for cross-hardware/parallelism equivalence (e.g., loss delta, grad-norm differences) and automated tooling; define metrics, tolerances, and CI pipelines.
  • Interleaved pipeline scheduling: Only a ~10% improvement is reported without ablations; detail sensitivity to micro-batch count, PP/VPP stages, recompute scopes, memory footprint, and bubble ratio across model depths.
  • Attention-aware data scheduling: Missing algorithmic details (cost model, balancing heuristic, complexity) and impact on convergence (gradient variance, data shuffling fairness); evaluate generality beyond EoD masks and interactions with DP/PP.
  • Hierarchical EP communication: Limited results (EP=16, +15% throughput); characterize performance across EP degrees, non-uniform topologies, variable network fabrics, and dynamic routing; quantify correctness overhead (local filtering) and failure handling.
  • EP communication overlap: No specifics on stream scheduling, buffer management, deadlock avoidance, and numeric determinism; assess contention risks (RoCE vs HCCS), memory pressure, and portability to other interconnects.
  • Operator fusion (DVM): Single-operator gains (~85%) lack end-to-end impact quantification; enumerate fused patterns, coverage rates, dynamic-shape support, debugging tooling, numerical equivalence guarantees, and portability to GPU/other NPUs.
  • Parallelization framework internals: Analytical model and ILP formulation (objective, constraints, compute/comm estimates) are unspecified; publish equations, solver scalability, error bounds, and sensitivity to estimation errors.
  • Tool generality and availability: Unknown applicability to dense transformers, alternative attention schemes, diverse hardware stacks (GPU/TPU), and integration with common frameworks; clarify open-source status, APIs, and reproducibility scripts.
  • Evaluation breadth: Performance comparisons only at 4,096 devices; provide scaling curves (512–8,192), MFU vs device count, throughput vs global batch size, step-time breakdowns (compute/comm/bubble), and variance across heterogeneous nodes.
  • Optimizer parallelism (OP/ZeRO): No concrete OP configuration (stage, partitioning, communication) and its interaction with DP/TP/EP; quantify memory savings, comm overhead, and numerical effects.
  • Fault tolerance and resilience: Absent discussion of node failures, checkpointing cadence, recovery policies, and training continuity under network perturbations; measure overheads and robustness at scale.
  • Energy and cost: No reporting of power draw, energy per token, MFU curves, thermal constraints, or carbon footprint; instrument and disclose energy-efficiency metrics and trade-offs (e.g., idle-mode changes).
  • Firmware modifications replicability: Device idle-mode threshold changes are hardware/firmware-specific; document reproducible parameters, safety/thermal implications, and generalizable policies for other accelerators.
  • IOMMU passthrough trade-offs: Security and isolation implications of passthrough mode are not addressed; evaluate risk profile, compliance constraints, and performance-security balance.
  • Resource isolation prescriptions: Provide concrete CPU affinity/kerneldomain settings, automation, and side-effects on system observability/maintenance; quantify overhead and stability under diverse workloads.
  • Inference performance: No latency/throughput measurements for MoE+MLA serving (batching, KV cache reuse, routing overhead), nor memory footprint under high concurrency; benchmark and optimize serving pathways.
  • Safety and alignment: Missing details on SFT/RLHF pipelines, safety filters, toxicity/fairness metrics, and hallucination management; include alignment training methods and standardized safety evaluations.
  • Expert specialization analysis: No measurement of expert entropy, specialization subjects, churn across training; analyze expert roles, stability, and contribution to downstream performance.
  • Reproducibility assets: MindSpore versioning, operator kernels, training scripts, config files, dataset preprocessing, and seed management are not provided; release complete artifacts and deterministic procedures.
  • Cross-platform portability: Unclear whether the verification workflows and performance optimizations translate to GPU/TPU stacks; port and compare against Megatron-DeepSpeed/K2/Mixtral baselines.
  • Long-context training data: Construction of 128K sequences (document concatenation policy, mask strategies) and its effect on distribution shift are unspecified; study quality impacts and alternative chunking/masking schemes.
  • Comparative baselines: No head-to-head throughput/MFU comparisons versus state-of-the-art MoE systems (Mixtral, DeepSeek-V3) under matched hardware; conduct standardized benchmarking.

Glossary

  • 1F1B: A pipeline scheduling strategy that overlaps one forward and one backward micro-batch to hide communication and reduce idle time. "we employ a carefully designed 1F1B (one forward, one backward) overlapping strategy"
  • Accumulation Precision: The numeric precision used when summing intermediate results in operators or gradients, which affects overall numerical accuracy. "Accumulation Precision: The choice of using float32 for accumulation versus float16/bfloat16 in mixed-precision kernels significantly impacts output precision."
  • All-to-All: A collective communication primitive where each device sends data to and receives data from every other device. "The intuitive implementation relied on a global All-to-All collective across all EP devices"
  • AllGather: A collective communication operation that gathers tensors from all devices and makes the full result available on each device. "an inter-node AllGather to collect complete EP data on each machine"
  • AllReduce: A collective operation that reduces (e.g., sums) tensors across devices and distributes the result back to all devices. "communication overhead (e.g., DP gradient AllReduce, EP All-to-All/AllGather, and TP collectives)"
  • Ascend NPU: Huawei’s neural processing unit used as the accelerator hardware for training. "trained end-to-end on Ascend NPU clusters using the MindSpore framework."
  • Attention-aware data scheduling: A data batching mechanism that balances computation by accounting for variable attention costs due to document structure and sequence composition. "attention-aware data scheduling for long- sequence load balancing"
  • DAPPLE: A pipeline parallelism system/scheduler that coordinates model partitioning and micro-batching across devices. "existing pipeline schedulers, such as GPipe Huang et al. (2019) and DAPPLE Yang et al. (2020)"
  • Device Virtual Machine (DVM): A framework-level abstraction that enables advanced kernel fusion and execution optimization across operator classes. "We propose an automatic operator fusion technique built on the DVM (Device Virtual Machine) framework."
  • End-of-Document (EoD) attention mask: A masking scheme that prevents tokens from attending across document boundaries when sequences are concatenated. "an End-of-Document (EoD) attention mask is applied"
  • Expert Parallelism (EP): A distributed training dimension that shards and routes Mixture-of-Experts across devices to handle expert computation. "In Mixture-of-Experts (MoE) training with Expert Parallelism (EP) Lepikhin et al. (2020); Rajbhan- dari et al. (2022), inter-device communication dominates training time."
  • Gating cluster: A dedicated small-scale cluster used to validate numerical precision and equivalence before deploying at full scale. "It leverages a dedicated gating cluster for systematic precision validation"
  • GPipe: A pipeline parallelism approach that trains large models by splitting layers into stages and using micro-batches to keep devices busy. "existing pipeline schedulers, such as GPipe Huang et al. (2019) and DAPPLE Yang et al. (2020)"
  • Grouped Query Attention (GQA): An attention variant that shares key/value projections across multiple query heads to reduce memory and computation. "Grouped Query Attention (GQA, Ainslie et al. (2023))"
  • GroupedMatMul: An operator that performs multiple matrix multiplications in grouped/batched fashion, often a fusion target for performance. "GroupedMatMul-Reshape-Cast sequence"
  • HCCS network: The intra-node interconnect used by Ascend NPUs for on-node high-bandwidth collectives. "intra-node All-to-All communication over the HCCS network"
  • ILP (Integer Linear Programming): An optimization technique used to find globally optimal parallelization and pipeline schedules under constraints. "employs an integer linear programming (ILP) solver"
  • Interleaved pipeline scheduling: Assigning non-contiguous layers to pipeline stages to reduce bubbles and enable finer-grained overlap. "we adopt interleaved pipeline scheduling"
  • IOMMU (Input/Output Memory Management Unit): Hardware that manages device memory accesses; passthrough mode can reduce overhead for high-performance training. "passthrough I/O memory management unit (IOMMU) configurations"
  • KV cache: Cached key/value tensors stored to accelerate autoregressive attention during long-context inference. "slashing KV cache Tay et al. (2020) usage"
  • Mixture-of-Experts (MoE): A sparse neural architecture where each input activates only a subset of expert sub-networks, enabling extreme parameter counts with limited compute. "Mixture-of-Experts (MoE) architectures"
  • Model Flops Utilization (MFU): A metric indicating how effectively a model uses the available floating-point operations of the hardware. "These advancements yield high Model Flops Utilization (MFU)"
  • Multi-Latent Attention (MLA): An attention mechanism that compresses KV representations into a latent space to reduce memory and improve compute intensity. "Multi-Latent Attention (MLA, DeepSeek-AI et al. (2024))"
  • Numerical equivalence: The property that different hardware or parallelization configurations produce effectively identical numerical results. "numerical equivalence issues may arise across varying parallelism strategies"
  • Operator fusion: Combining multiple operators into a single kernel to reduce memory traffic and kernel launch overhead. "We propose an automatic operator fusion technique"
  • Operator-wise accuracy verification: A methodology to validate the numerical correctness of individual operators to prevent error propagation. "3.1 OPERATOR-WISE ACCURACY VERIFICATION"
  • Optimizer Parallelism (OP): Distributing optimizer states and computations across devices to reduce memory overhead and scale training. "optimizer parallelism (OP, similar to Zero)"
  • Pipeline bubbles: Idle periods in pipeline parallelism created by data dependencies between stages, reducing throughput. "pipeline bubble overhead"
  • Pipeline parallelism (PP): Training large models by splitting layers into sequential stages across devices and processing micro-batches through the pipeline. "Pipeline parallelism has become a widely adopted technique"
  • ReduceScatter: A collective operation that reduces data across devices and scatters the partitioned result to each device. "AllReduce or ReduceScatter."
  • RoCE network: RDMA over Converged Ethernet; a high-performance network used for inter-node communication. "over the RoCE network"
  • RoPE: Rotary positional embeddings, a method for encoding position that can be sensitive to sharding offsets. "Offset errors in positional encodings (e.g., RoPE)"
  • Sequence Parallelism (SP): A parallelization strategy that shards the sequence dimension across devices to reduce memory and scale attention. "Sequence Parallelism (SP)"
  • Shallow-and-Wide topology: A model design with fewer layers but wider hidden sizes to increase per-layer compute density and reduce pipeline bubbles. "This "shallow-and-wide" topology enhances computational density per layer"
  • Top-k routing: Selecting the top-k experts for each token in a Mixture-of-Experts layer to activate sparsely. "top-4 to top-8 routing"
  • ULP: Unit in the Last Place; the smallest step between representable floating-point numbers, used to quantify rounding differences. "ULP-level differences"
  • Virtual Pipeline Interleaving (VPP): Running multiple interleaved virtual pipelines per device to reduce bubbles and balance load. "virtual pipeline interleaving (VPP)"
  • Recomputation: A memory-saving technique that discards intermediate activations during forward pass and recomputes them during backward pass. "selective recomputation"
  • Mixed-precision kernels: Kernels that use lower-precision formats (e.g., fp16, bf16) for speed while accumulating in higher precision to preserve accuracy. "mixed- precision kernels"

Practical Applications

Immediate Applications

The following applications can be deployed now using the paper’s methods, tooling, and architecture choices. Each item indicates sectors, a brief description, potential tools/products/workflows, and key assumptions/dependencies.

  • Cross-hardware training precision alignment workflow (software/AI infrastructure; academia)
    • Description: Systematically verify numerical equivalence when migrating training between hardware (e.g., GPU to Ascend NPU) or changing parallelism setups, reducing failed large-scale runs.
    • Tools/Products/Workflows: “Golden baseline” CPU reference harness; deterministic seeds; layer-wise tensor dumping; stepwise loss/grad-norm checks; tolerance thresholds by accumulation count.
    • Assumptions/Dependencies: Access to identical tokenization, initialization, data sharding, and MP/DP/PP settings; logging/tensor-dump support; cooperation across frameworks (e.g., MindSpore, PyTorch).
  • Operator-wise accuracy CI for kernels (semiconductors; ML frameworks)
    • Description: Integrate operator-level precision tests (with accumulation-aware tolerances and extreme-value input clipping) into continuous integration for backends.
    • Tools/Products/Workflows: CPU FP32/FP64 golden references; tolerance tables by accumulation count; automated operator regression suite.
    • Assumptions/Dependencies: Backend exposes numerics knobs (accumulation dtype, casting rules); test data generators; CI infrastructure.
  • Interleaved pipeline scheduling with 1F1B overlap (software/AI infrastructure; cloud)
    • Description: Reduce pipeline bubbles and overlap comm/compute for ~10% end-to-end training throughput gain.
    • Tools/Products/Workflows: Scheduler plugin/flag in MindSpore or Megatron-style stacks; microbatching policies; pipeline interleaving configs.
    • Assumptions/Dependencies: Sufficient memory for interleaving; stable 1F1B implemention; profiling to tune micro-batch counts.
  • Attention-aware data scheduling for long-sequence training (software/AI infrastructure; media/legal/finance analytics)
    • Description: Balance sparse-attention workloads across devices by redistributing samples using subsequence/document-length metadata, improving long-context throughput (e.g., 32K–128K).
    • Tools/Products/Workflows: Custom dataloader/sampler; per-sample document-length features; scheduler for micro-batch construction.
    • Assumptions/Dependencies: Access to sequence segmentation/EoD masks; long-context attention kernels; datasets with variable document lengths.
  • Hierarchical, topology-aware Expert Parallelism (EP) communication (cloud/HPC networking; software/AI infrastructure)
    • Description: Replace global All-to-All with inter-node AllGather + local filtering + intra-node All-to-All to reduce redundant cross-machine traffic (~15% throughput boost under EP=16).
    • Tools/Products/Workflows: EP comm library that is topology-aware; network planning exploiting intra-node vs inter-node bandwidth.
    • Assumptions/Dependencies: Known cluster topology (e.g., RoCE between nodes, HCCS in-node); MoE with routed experts; framework hooks for custom collectives.
  • EP communication overlap via multi-stream scheduling (software/AI infrastructure)
    • Description: Overlap EP AllGather/All-to-All with FFN compute and with other comm streams by slicing batch/sequence dimensions; cut EP comm share from ~30% to ~5% of total.
    • Tools/Products/Workflows: Multi-stream runtime; fine-grained comm/compute scheduler; per-dimension partitioning policies.
    • Assumptions/Dependencies: Hardware/network supports parallel comm streams without mutual contention; careful stream synchronization.
  • DVM-based cross-class operator fusion (compilers; semiconductors; software/AI infrastructure)
    • Description: Fuse Cube-class (matmul) and Vector-class ops (reshape/cast/eltwise) to reduce memory traffic and kernel launches; e.g., ~85% single-kernel speedup for GroupedMatMul-Reshape-Cast.
    • Tools/Products/Workflows: Compiler pass in MindSpore/Ascend (DVM); fusion rule libraries; operator pipelines profiling.
    • Assumptions/Dependencies: Hardware-compiler stack supports on-chip L2 reuse and cross-class fusion; correctness-preserving fusion legality checks.
  • Systematic parallelization framework (auto-parallel tuner) (MLOps; cloud; academia)
    • Description: Analytical + ILP-based search for DP/TP/PP/VPP/SP/EP/OP and recomputation, reducing tuning time from ~7 days to ~0.5 day with comparable/better step time.
    • Tools/Products/Workflows: YAML model parser; analytical cost models; ILP solver; short dry-run validation; promotion to production strategy.
    • Assumptions/Dependencies: Calibrated analytical models; ILP solver availability; accurate memory ceilings; limited dry-run window.
  • Cluster performance hardening for large-scale training (data center operations; energy/IT policy; cloud)
    • Description: CPU affinity and isolation domains to reduce host contention (variance ↓ up to 38%); firmware tweaks to prevent spurious idle down-clocking (throughput +25–30% @4096 devices); IOMMU passthrough for monitoring overhead (-3–5% step time).
    • Tools/Products/Workflows: NUMA/CPU pinning; cgroups/isolation; vendor firmware profiles; IOMMU configuration; performance observability SOPs.
    • Assumptions/Dependencies: Admin/root access; vendor cooperation for firmware policy; monitoring stack configurability; acceptance of energy/perf trade-offs.
  • Higher-concurrency MoE inference using high-sparsity routing and MLA (software; telecom; consumer apps)
    • Description: Serve more concurrent users per device due to low activation footprint (MoE) and reduced KV cache (MLA), especially for long-context chat/search.
    • Tools/Products/Workflows: MLA-aware KV-cache management in inference servers (e.g., vLLM/TensorRT-LLM equivalents); MoE-aware batch schedulers.
    • Assumptions/Dependencies: Inference stack supports MLA and MoE routing; routing stability controls; compatibility with 32K–128K contexts.
  • Reproducibility and audit workflows for regulated training (policy/compliance; finance/healthcare)
    • Description: Use the staged alignment process (loss/gradient checks under LR=0 vs >0; optimizer equivalence) to produce audit trails for regulated model training.
    • Tools/Products/Workflows: Versioned configs; deterministic seeds; checkpoint converters; logged equivalence reports.
    • Assumptions/Dependencies: Willingness to incur small overhead for logging/dumping; organizational policy alignment.
  • Open-source model and infra adoption in research and teaching (academia)
    • Description: Use TeleChat3-MoE repos and training stack as reproducible teaching examples for frontier-scale training labs/courses.
    • Tools/Products/Workflows: Course labs on accuracy verification, pipeline scheduling, parallel strategy design; small-cluster replicas (e.g., 105B model on modest NPUs/GPUs).
    • Assumptions/Dependencies: Access to compatible hardware or emulation; appropriate curriculum integration.

Long-Term Applications

These applications are enabled or accelerated by the paper’s innovations but require further research, scaling, or ecosystem development.

  • Cross-vendor numerical equivalence certification programs (policy; standardization; semiconductors)
    • Description: Standardize “golden baseline + tolerance-by-accumulation” tests and end-to-end alignment workflows as certification for chips/frameworks running LLM training.
    • Tools/Products/Workflows: Open conformance test suite; third-party audits; reproducibility badges.
    • Assumptions/Dependencies: Multi-vendor coordination; public benchmarks and artifacts; governance.
  • Generalized auto-parallelization as a managed cloud service (cloud; MLOps)
    • Description: Offer “Auto-Parallel Tuner” that emits optimal DP/TP/PP/VPP/SP/EP/OP strategies and recomputation plans for diverse models/hardware; one-click deployment.
    • Tools/Products/Workflows: Cloud API/console; ILP-backed planner; continuous calibration to hardware/network telemetry.
    • Assumptions/Dependencies: Accurate analytical models across vendors; integration with PyTorch/DeepSpeed/Megatron and MindSpore; SLA for performance predictability.
  • Dynamic, topology-aware EP runtimes for heterogeneous clusters (HPC; cloud)
    • Description: At runtime, adapt hierarchical EP routes and overlaps to changing traffic and mixed interconnects (RoCE/IB/ethernet), minimizing tail latency at scale.
    • Tools/Products/Workflows: Runtime topology discovery; congestion-aware schedulers; multi-path collectives.
    • Assumptions/Dependencies: Visibility into fabric metrics; advanced collective libraries; robust fault tolerance.
  • Compiler-level cross-class fusion in general-purpose ML compilers (compilers; semiconductors)
    • Description: Extend DVM-like cross-class fusion (Cube/Vector) to TVM, XLA, TorchInductor, enabling co-designed kernels for diverse backends.
    • Tools/Products/Workflows: Fusion legality/profitability models; L2 reuse annotations; end-to-end autotuning.
    • Assumptions/Dependencies: Backend support for fine-grained scheduling; alignment on IR semantics; verification for numerics.
  • Energy- and carbon-aware training schedulers (energy; data center operations; policy)
    • Description: Use firmware-policy insights (idle thresholds) with carbon-intensity signals to schedule training for minimal energy/carbon at fixed throughput targets.
    • Tools/Products/Workflows: Power/telemetry ingestion; energy-aware job schedulers; per-operator DVFS policy tuning.
    • Assumptions/Dependencies: Access to energy telemetry; chip DVFS controls; policy incentives or SLAs.
  • Domain-specialized trillion-parameter MoE pretraining (healthcare; finance; legal)
    • Description: Leverage high-sparsity MoE + infra to economically pretrain expert-heavy domain models with 32K–128K context for EHR timelines, financial filings, legal corpora.
    • Tools/Products/Workflows: Secure data pipelines; domain routers/expert design; compliance-evaluated training with reproducibility checks.
    • Assumptions/Dependencies: High-quality domain data; privacy and regulatory clearance; robust evaluation/safety.
  • Long-context on-device or edge inference using MLA (mobile; IoT; telecom)
    • Description: Deploy MLA’s compressed KV and higher compute-to-memory ratio to run longer contexts on constrained NPUs (phones, edge gateways).
    • Tools/Products/Workflows: Mobile inference runtimes with MLA kernels; memory-aware batching; partial offload to network.
    • Assumptions/Dependencies: Mobile NPU/compiler support; quantization and thermal constraints; productization effort.
  • Training-as-a-Service for MoE with cost guarantees (cloud; SaaS)
    • Description: Bundle hierarchical EP comm, overlap, and auto-parallel tuner into a managed service with price/performance SLAs for customers training 100B–1T models.
    • Tools/Products/Workflows: Cost estimators using analytical models; curated strategy catalogs; telemetry-driven continuous optimization.
    • Assumptions/Dependencies: Transparent pricing and performance modeling; multi-tenant isolation; legal terms for resource variability.
  • Adaptive attention-aware schedulers for inference (software/AI infrastructure)
    • Description: Extend training-time attention-aware scheduling to inference—co-schedule requests based on segment-length heterogeneity to stabilize p95 latencies for long-context APIs.
    • Tools/Products/Workflows: Inference queueing with document-length features; batch builders aware of sparse attention cost.
    • Assumptions/Dependencies: Accurate per-request cost predictors; admission control integration; SLA management.
  • Procurement and sovereign compute strategy guidance (policy; public sector)
    • Description: Use the paper’s cross-hardware alignment and cluster-hardening methods to inform national AI compute stacks (reproducibility, energy/perf trade-offs, firmware controls).
    • Tools/Products/Workflows: Best-practice playbooks; readiness assessments; capability audits for public tenders.
    • Assumptions/Dependencies: Government-industry collaboration; standards alignment; security reviews.
  • SME-accessible large-model fine-tuning (industry/SMBs; education)
    • Description: With improved throughput and auto-parallel tooling, enable fine-tuning of ~100B MoE models on smaller, cost-limited clusters for verticals (customer support, code assistants).
    • Tools/Products/Workflows: Prebuilt recipes on modest NPU/GPU pods; domain adapters; evaluation harnesses.
    • Assumptions/Dependencies: Availability of base MoE checkpoints; licensing; simplified ops for non-expert teams.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 116 likes about this paper.