Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models (2506.18945v1)

Published 23 Jun 2025 in cs.LG and cs.CL

Abstract: We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.

Summary

The paper introduces a novel Chain-of-Experts (CoE) approach that implements sequential expert communication, dramatically increasing the diversity of expert combinations.
The paper demonstrates that CoE reduces validation loss from 1.20 to 1.12 on math reasoning tasks and matches the performance of wider MoE models with up to 42% memory savings.
The paper shows that iteration-specific routers and inner residuals are critical for stable training and improved expert specialization in sparse activation regimes.

Chain-of-Experts: Sequential Expert Communication for Enhanced Mixture-of-Experts Models

The "Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models" (2506.18945) paper introduces a novel architectural paradigm for Mixture-of-Experts (MoE) models, termed Chain-of-Experts (CoE). This approach fundamentally rethinks the intra-layer structure of MoE by enabling sequential, communicative expert processing, as opposed to the conventional parallel and independent expert activation. The work is motivated by the observation that, while MoE architectures have enabled efficient scaling of LLMs through sparse expert activation, the lack of explicit expert interaction may limit their representational and reasoning capacity, particularly for tasks requiring multi-step or compositional reasoning.

Core Contributions

The principal innovation of CoE is the introduction of iterative expert communication within each MoE layer. Instead of routing each token through a fixed set of experts in parallel, CoE processes tokens through a chain of expert steps, where each step can dynamically select a new set of experts based on the intermediate representation produced by the previous step. This is achieved via:

Iteration-specific routers: Each communication step within a layer has its own router, allowing tokens to adaptively select different experts at each iteration.
Residual connections at each step: Inner residuals stabilize the iterative refinement of token representations.
Flexible scaling axis: The number of communication steps (depth through iteration) serves as a new scaling dimension, complementing traditional width (number of experts) and depth (number of layers).

Formally, for $C$ communication steps, the token representation is updated as follows:

x = input_token
for t in range(C):
    # Compute routing weights for this iteration
    g = router_t(x)
    # Select top-K/C experts for this step
    selected_experts = top_k_experts(g, K // C)
    # Aggregate expert outputs
    expert_out = sum(g[i] * expert_i(x) for i in selected_experts)
    # Apply inner residual
    x = expert_out + x
output = x

This design increases the diversity of expert combinations from $C(n, 2k)$ in standard MoE to $C(n, k)^2$ in CoE for two steps, where $n$ is the number of experts and $k$ is the number selected per step.

Empirical Results

The paper presents a comprehensive empirical evaluation on both general-domain (SlimPajama) and reasoning-focused (MetaMathQA) datasets. Key findings include:

Improved validation loss: On math reasoning tasks, CoE reduces validation loss from 1.20 to 1.12 compared to standard MoE under matched compute.
Superior scaling efficiency: With 2× communication steps, CoE matches the performance of 3× expert width scaling, while reducing memory usage by 17.6–42%.
Enhanced expert specialization: CoE enables up to 823× more effective expert combinations, as evidenced by co-activation analyses.
Robustness in sparse regimes: The benefits of CoE are most pronounced when expert routing is sparse; dense expert activation diminishes the advantage of sequential communication.

Ablation studies confirm that both iteration-specific gating and inner residuals are critical for the observed gains. Removing either component leads to significant degradation in convergence and final performance.

Theoretical and Practical Implications

Theoretically, CoE increases the combinatorial flexibility of expert selection and introduces implicit depth via iterative expert processing. This enables richer compositional reasoning and more expressive representations without increasing parameter count or memory footprint. The architecture aligns with recent findings that deeper computation pathways correlate with improved reasoning, especially in domains such as mathematics and logic.

Practically, CoE offers a new axis for scaling sparse models. Instead of increasing the number of experts or layers—which can be prohibitive in terms of memory and compute—CoE leverages expert reuse and sequential communication to achieve comparable or superior performance. This is particularly advantageous in compute-constrained environments or when deploying models on hardware with limited parallelism.

However, the sequential nature of CoE introduces moderate time overhead due to reduced matrix multiplication parallelism per iteration. This can slow training and inference unless low-level optimizations are employed. Additionally, CoE is not directly compatible with pretrained MoE checkpoints, necessitating training from scratch.

Implementation Considerations

Implementing CoE in practice involves several key modifications to standard MoE frameworks:

Router design: Each iteration within a layer requires an independent router, which must be efficiently implemented to avoid bottlenecks.
Expert selection: The top-K/C selection per iteration must be balanced to maintain overall compute parity with standard MoE.
Residual connections: Inner residuals should be applied at each step to stabilize training.
Parallelization: While expert computation within each step can be parallelized, the steps themselves are sequential, which may require custom scheduling for optimal hardware utilization.

A high-level architecture diagram for a CoE layer is as follows:

Input Token
    │
[Iteration 1]
    │
Router_1 → Select Experts_1 → Aggregate Outputs
    │
+ Residual
    │
[Iteration 2]
    │
Router_2 → Select Experts_2 → Aggregate Outputs
    │
+ Residual
    │
...
    │
Output Token

Limitations and Future Directions

The current implementation of CoE has been validated primarily on single-device setups and with a limited number of communication steps (typically two). Scaling to larger models, more communication steps, and distributed training remains an open challenge. The impact of CoE on broader domains, downstream tasks, and compatibility with advanced training formats (e.g., FP8) warrants further investigation.

Future research directions include:

Scaling laws: Systematic paper of CoE under large-scale training to assess whether its advantages persist.
Deeper iterative depth: Exploring the effect of increasing the number of communication steps per layer.
Hybrid architectures: Combining CoE with inter-layer expert sharing (e.g., MoEUT) for enhanced parameter efficiency.
Hardware optimization: Developing low-level kernels to mitigate the sequential overhead of CoE.

Conclusion

Chain-of-Experts represents a significant step toward more communicative and compositional sparse architectures. By enabling sequential expert interaction within each layer, CoE achieves improved performance, efficiency, and specialization under fixed compute budgets. The work highlights the importance of information flow and expert communication in modular neural architectures, suggesting new directions for the design of scalable, efficient, and reasoning-capable foundation models.

PDF Markdown

Related Papers

GitHub

GitHub - ZihanWang314/CoE: Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models (170 stars)

Tweets

https://twitter.com/wzihanw/status/1937907133179404518

https://twitter.com/Shiwei_Liu66/status/1938167471871066589

https://twitter.com/TheTuringPost/status/1938728845060010324

https://twitter.com/fly51fly/status/1937989135031275584

https://twitter.com/HuggingPapers/status/1938932399250735177

https://twitter.com/gm8xx8/status/1938383530863579469