Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts (2502.19811v3)

Published 27 Feb 2025 in cs.DC, cs.AI, and cs.LG

Abstract: Mixture-of-experts (MoE) has been extensively employed to scale LLMs to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces Comet, a system using fine-grained computation-communication overlap to optimize distributed Mixture-of-Experts (MoE) execution.
Comet achieves this via fine-grained data dependency analysis and task rescheduling, maximizing overlap between computation and communication within MoE layers.
Evaluations show Comet provides up to 1.96x speedup for MoE layers and 1.71x end-to-end, successfully deployed on large GPU clusters.

The paper "Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts" (2502.19811) presents a system designed to optimize the execution of large-scale Mixture-of-Experts (MoE) models in distributed environments. It addresses the significant communication overhead inherent in MoE layers, particularly the All-to-All communication required for expert routing, which can reportedly consume up to 47% of the total execution time. While prior approaches attempted to mitigate this bottleneck through coarse-grained pipelining of communication and computation, these methods often suffer from suboptimal latency hiding and can negatively impact computational efficiency. Comet proposes a fine-grained overlapping strategy based on detailed data dependency analysis and task rescheduling to improve performance.

Problem Formulation and Motivation

The core challenge in scaling MoE models lies in the communication pattern introduced by the expert parallelism strategy. Typically, input tokens are routed to different experts distributed across multiple devices (e.g., GPUs). This routing necessitates an All-to-All communication primitive where each device sends token representations destined for specific experts to the devices hosting those experts, and correspondingly receives token representations for the experts it hosts. This collective communication operation becomes a major performance bottleneck as model scale and the number of devices increase.

Existing solutions often employ coarse-grained pipelining. For instance, they might overlap the All-to-All communication of one MoE layer with the computation of preceding or subsequent dense layers (e.g., MLP or attention blocks) or attempt to pipeline computation and communication within the MoE layer itself at a macro level. However, the paper argues that such coarse-grained approaches are limited. They may introduce pipeline bubbles or dependencies that prevent full overlap, and the fixed scheduling might not adapt well to varying workloads or network conditions, leading to inefficient resource utilization and suboptimal latency concealment. The synchronization points required by these coarse-grained methods can also impair overall computational throughput.

Comet Methodology: Fine-grained Overlapping

Comet introduces a fine-grained approach to overlap communication and computation within MoE layers. The key components are:

Data Dependency Analysis: Comet performs a detailed analysis of the data dependencies between the computational tasks (e.g., gating network computation, expert computation) and communication tasks (e.g., All-to-All exchanges) within the MoE layer. This allows identifying specific, smaller units of computation and communication that can be executed concurrently without violating data dependencies. For example, the computation for a subset of tokens or a specific micro-batch might be overlapped with the communication related to a different subset.
Task Rescheduling: Based on the dependency analysis, Comet implements a task rescheduling mechanism. Instead of executing computation and communication phases in bulk, it breaks them down into smaller, fine-grained tasks. These tasks are then dynamically scheduled to maximize the overlap between communication sends/receives and independent computational operations. The goal is to keep both the computation units (e.g., CUDA cores) and the network interfaces busy concurrently as much as possible. This fine-grained scheduling allows computation to proceed on locally available data while communication for other data chunks occurs in the background.
Adaptive Workload Assignment: Comet incorporates an adaptive workload assignment strategy. This mechanism aims to balance the load across devices dynamically and mitigate potential fine-grained communication bottlenecks that might arise due to imbalances in token routing or network fluctuations. By monitoring communication progress and computational load, Comet can potentially adjust task scheduling or resource allocation to maintain high overlap efficiency across different hardware configurations, network conditions, and variations in the number of tokens routed to each expert. This adaptability is crucial for robust performance in large, heterogeneous clusters.

Implementation and System Aspects

While the paper abstract doesn't provide explicit code or low-level implementation recipes, it implies integration within distributed training frameworks commonly used for large model training. Implementing Comet would likely involve:

Custom Communication Primitives: Potentially requiring modifications or extensions to standard collective communication libraries (like NCCL) or the development of custom communication kernels that allow for finer-grained, asynchronous operations interleaved with computation.
Scheduler Integration: Modifying the execution scheduler of the deep learning framework (e.g., PyTorch's execution graph, custom CUDA stream management) to accommodate the dynamic, fine-grained task dependencies and rescheduling logic proposed by Comet. This involves managing dependencies between numerous small computation kernels and communication calls.
Profiling and Analysis Tools: Development or use of profiling tools capable of visualizing and analyzing fine-grained compute-communication overlap to tune the system and verify its effectiveness.

The system architecture conceptually involves intercepting the MoE layer execution, decomposing its operations based on the dependency analysis, and utilizing an optimized scheduler to dispatch these fine-grained tasks onto computational resources and network interfaces, ensuring maximal concurrency.

routing_indices, routing_weights = gate(local_tokens)

send_buffers = prepare_send_buffers(local_tokens, routing_indices)
recv_buffers = allocate_recv_buffers()

comm_handles = []
compute_streams = create_compute_streams()
comm_stream = create_comm_stream()

with stream(comm_stream):
    handle = non_blocking_all_to_all(send_buffers, recv_buffers)
    comm_handles.append(handle)

while not all_comms_complete(comm_handles):
    # Check if any communication chunk has completed
    completed_indices = check_completed_comms(comm_handles)

    for idx in completed_indices:
        # Get received tokens for local experts corresponding to completed comm chunk
        tokens_for_local_experts = get_data_from_recv_buffer(recv_buffers, idx)

        # Schedule expert computation on received tokens using compute streams
        with stream(compute_streams[idx % num_compute_streams]):
            expert_outputs_chunk = experts(tokens_for_local_experts)
            # Store/accumulate partial results

    # Optionally schedule more independent compute if available
    # schedule_more_independent_compute(compute_streams)

synchronize_streams(comm_stream)

synchronize_streams(compute_streams)

final_output = combine_expert_outputs(expert_outputs, routing_weights)

return final_output

Evaluation and Results

Comet's performance was evaluated against baseline MoE implementations, likely those found in popular frameworks employing coarse-grained overlapping. The key results reported are significant speedups:

Single MoE Layer: Comet achieves up to 1.96x speedup in execution time compared to baselines.
End-to-End Model Execution: For entire models incorporating MoE layers, Comet delivers an average speedup of 1.71x.

These evaluations were conducted on large-scale GPU clusters. Notably, the paper claims that Comet has been successfully deployed in production environments involving ten-thousand-scale GPU clusters, leading to substantial computational savings quantified as millions of GPU hours. This suggests practical viability and effectiveness at scale, beyond typical academic benchmark scenarios. The adaptability feature likely plays a crucial role in achieving robust performance across such large and potentially heterogeneous systems.

Conclusion

Comet (2502.19811) offers a refined approach to optimizing distributed MoE execution by replacing coarse-grained pipelining with fine-grained computation-communication overlapping. Through detailed dependency analysis, task rescheduling, and adaptive workload assignment, it aims to maximize hardware utilization and significantly reduce the latency impact of the inherent All-to-All communication bottleneck. The reported speedups (up to 1.96x for MoE layers, 1.71x end-to-end) and successful large-scale production deployment underscore its potential as an effective optimization technique for training and deploying increasingly large MoE models.