mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training (2501.03905v2)

Published 7 Jan 2025 in cs.NI and cs.LG

Abstract: Mixture-of-Expert (MoE) models outperform conventional models by selectively activating different subnets, named \emph{experts}, on a per-token basis. This gated computation generates dynamic communications that cannot be determined beforehand, challenging the existing GPU interconnects that remain \emph{static} during the distributed training process. In this paper, we advocate for a first-of-its-kind system, called mFabric, that unlocks topology reconfiguration \emph{during} distributed MoE training. Towards this vision, we first perform a production measurement study and show that the MoE dynamic communication pattern has \emph{strong locality}, alleviating the requirement of global reconfiguration. Based on this, we design and implement a \emph{regionally reconfigurable high-bandwidth domain} on top of existing electrical interconnects using optical circuit switching (OCS), achieving scalability while maintaining rapid adaptability. We have built a fully functional mFabric prototype with commodity hardware and a customized collective communication runtime that trains state-of-the-art MoE models with \emph{in-training} topology reconfiguration across 32 A100 GPUs. Large-scale packet-level simulations show that mFabric delivers comparable performance as the non-blocking fat-tree fabric while boosting the training cost efficiency (e.g., performance per dollar) of four representative MoE models by 1.2$\times$--1.5$\times$ and 1.9$\times$--2.3$\times$ at 100 Gbps and 400 Gbps link bandwidths, respectively.

Summary

The paper demonstrates that Tessera reduces networking costs by up to 2.3 times while achieving near ideal training speeds for MoE models.
The paper introduces a dual-mode architecture that leverages optical circuit switching for high-bandwidth local traffic and electrical packet switching for global connectivity.
The paper confirms through evaluations that Tessera scales effectively to data center deployments, dynamically adapting to complex MoE traffic patterns.

An Efficient and Scalable Fabric for Mixture-of-Experts Training

The paper presents a novel system, named Tessera, designed to address the challenges associated with the computational demands of Mixture-of-Experts (MoE) models. These models, which have gained traction within the machine learning community for enhancing the performance per cost benefits of LLMs, necessitate an adaptable and efficient communication infrastructure due to their dynamic computation patterns. Tessera proposes a regionally reconfigurable network architecture to optimize the distributed training of MoE models, focusing on adapting to their unique traffic patterns.

MoE Models and Their Communication Challenges

MoE models enhance computational efficiency by selectively activating different subnetworks, known as experts, for each input token. This selective activation results in non-uniform and non-deterministic communication patterns that challenge traditional GPU interconnects, which are typically static. In the context of MoE architectures, communication traffic primarily arises from expert parallelism (EP), where selected expert layers require large-scale, all-to-all communication. Addressing these demands is crucial for optimizing training times and resource usage in large model deployments.

Tessera's Architectural Design

To address these challenges, Tessera implements an innovative network fabric incorporating regionally reconfigurable optical circuit switching (OCS). This approach leverages the locality inherent in MoE communication patterns, allowing dynamic adaptation with minimal latency:

Regionally Reconfigurable High-Bandwidth Domains: Tessera introduces geographically constrained OCS domains to efficiently handle the high traffic demands of MoE model training by reallocating optical bandwidth dynamically according to demand.
Traffic Adaptation: Tessera's architecture is informed by detailed measurements of MoE traffic patterns, revealing strong locality and temporally stable regions of interest. This enables the design of an OCS capable of reconfiguration during training, a feature not typically feasible in global OCS solutions due to latency constraints.
Integrative System Design: Tessera integrates electrical packet switching (EPS) for global connectivity while using OCS to mitigate bottlenecks in bandwidth availability, thus shifting high-demand traffic out of the EPS as needed.

Performance Evaluation and Cost Efficiency

Through experimental evaluation, Tessera demonstrates significant improvements in MoE model training efficiency:

Cost Efficiency: Tessera shows a reduction in networking costs compared to traditional fat-tree topologies, achieving up to 2.3 times more cost-efficiency in terms of bandwidth provisioning.
Performance Metrics: Tessera exhibits comparable training speeds to ideal, cost-intensive network setups like non-blocking fat-tree topologies, meanwhile significantly surpassing less adaptive solutions such as TopoOpt, particularly in scenarios with high expert parallelism demands.
Scalability: The Tessera system exhibits scalability to data center-level deployments, proving effective in simulations involving thousands of GPUs.

Implications and Future Prospects

The implication of Tessera's research suggests a shift towards hybrid optical-electrical systems tailored to the specific demands of modern machine learning workloads, particularly those involving dynamic computation patterns like MoE models. By aligning the architecture of interconnects with the communication patterns of these models, it is possible to achieve substantial gains in both cost and performance efficiency. In the future, such systems may evolve to incorporate more advanced materials and switching technologies, potentially extending these benefits as ML models continue to grow in complexity and size. Furthermore, Tessera's insights into the communication demands of MoE models could serve as a blueprint for future developments in distributed training systems, emphasizing adaptability and regional optimization in network design.