- The paper demonstrates that Tessera reduces networking costs by up to 2.3 times while achieving near ideal training speeds for MoE models.
- The paper introduces a dual-mode architecture that leverages optical circuit switching for high-bandwidth local traffic and electrical packet switching for global connectivity.
- The paper confirms through evaluations that Tessera scales effectively to data center deployments, dynamically adapting to complex MoE traffic patterns.
An Efficient and Scalable Fabric for Mixture-of-Experts Training
The paper presents a novel system, named Tessera, designed to address the challenges associated with the computational demands of Mixture-of-Experts (MoE) models. These models, which have gained traction within the machine learning community for enhancing the performance per cost benefits of LLMs, necessitate an adaptable and efficient communication infrastructure due to their dynamic computation patterns. Tessera proposes a regionally reconfigurable network architecture to optimize the distributed training of MoE models, focusing on adapting to their unique traffic patterns.
MoE Models and Their Communication Challenges
MoE models enhance computational efficiency by selectively activating different subnetworks, known as experts, for each input token. This selective activation results in non-uniform and non-deterministic communication patterns that challenge traditional GPU interconnects, which are typically static. In the context of MoE architectures, communication traffic primarily arises from expert parallelism (EP), where selected expert layers require large-scale, all-to-all communication. Addressing these demands is crucial for optimizing training times and resource usage in large model deployments.
Tessera's Architectural Design
To address these challenges, Tessera implements an innovative network fabric incorporating regionally reconfigurable optical circuit switching (OCS). This approach leverages the locality inherent in MoE communication patterns, allowing dynamic adaptation with minimal latency:
- Regionally Reconfigurable High-Bandwidth Domains: Tessera introduces geographically constrained OCS domains to efficiently handle the high traffic demands of MoE model training by reallocating optical bandwidth dynamically according to demand.
- Traffic Adaptation: Tessera's architecture is informed by detailed measurements of MoE traffic patterns, revealing strong locality and temporally stable regions of interest. This enables the design of an OCS capable of reconfiguration during training, a feature not typically feasible in global OCS solutions due to latency constraints.
- Integrative System Design: Tessera integrates electrical packet switching (EPS) for global connectivity while using OCS to mitigate bottlenecks in bandwidth availability, thus shifting high-demand traffic out of the EPS as needed.
Through experimental evaluation, Tessera demonstrates significant improvements in MoE model training efficiency:
- Cost Efficiency: Tessera shows a reduction in networking costs compared to traditional fat-tree topologies, achieving up to 2.3 times more cost-efficiency in terms of bandwidth provisioning.
- Performance Metrics: Tessera exhibits comparable training speeds to ideal, cost-intensive network setups like non-blocking fat-tree topologies, meanwhile significantly surpassing less adaptive solutions such as TopoOpt, particularly in scenarios with high expert parallelism demands.
- Scalability: The Tessera system exhibits scalability to data center-level deployments, proving effective in simulations involving thousands of GPUs.
Implications and Future Prospects
The implication of Tessera's research suggests a shift towards hybrid optical-electrical systems tailored to the specific demands of modern machine learning workloads, particularly those involving dynamic computation patterns like MoE models. By aligning the architecture of interconnects with the communication patterns of these models, it is possible to achieve substantial gains in both cost and performance efficiency. In the future, such systems may evolve to incorporate more advanced materials and switching technologies, potentially extending these benefits as ML models continue to grow in complexity and size. Furthermore, Tessera's insights into the communication demands of MoE models could serve as a blueprint for future developments in distributed training systems, emphasizing adaptability and regional optimization in network design.