Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection (2406.00023v3)

Published 24 May 2024 in cs.CL

Abstract: Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for LLMs, offering unprecedented computational efficiency. However, these architectures grapple with challenges of token distribution imbalance and expert homogenization, impeding optimal semantic generalization. We propose a novel expert routing framework that incorporates: (1) An efficient routing mechanism with lightweight computation. (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens. (3) A module that determines the lower bounds of expert capacity based on dynamic token distribution analysis, specifically designed to address drop-and-pad strategies. It is also integrated with orthogonal feature extraction module and an optimized loss function for expert localization. This framework effectively reduces expert homogeneity while enhancing the performance of the expert selection module. Additionally, we introduce a local expert strategy that simultaneously improves load balancing and reduces network communication overhead. It achieves a 40\% reduction in token processed by each expert without compromising model convergence or efficacy. When coupled with communication optimizations, the training efficiency improvements of 5.4\% to 46.6\% can be observed. After supervised fine-tuning, it exhibits performance gains of 9.7\% to 14.1\% across GDAD, GPQA, and TeleQnA benchmarks.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Tweets

https://twitter.com/ogawa_tter/status/1885586606486413736

https://twitter.com/ogawa_tter/status/1885681452182012239

https://twitter.com/ogawa_tter/status/1885332830382940175

https://twitter.com/ogawa_tter/status/1885350860609511709

Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection (2406.00023v3)

Summary

Follow-up Questions

Related Papers

Tweets