Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design (2410.19123v1)
Abstract: The proliferation of LLMs has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and performance. Despite their benefits, MoE models face significant challenges during inference, including inefficient memory management and suboptimal batching, due to misaligned design choices between the model architecture and the system policies. Furthermore, the conventional approach of training MoEs from scratch is increasingly prohibitive in terms of cost. In this paper, we propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to "upcycling" generalist MoEs), avoiding the high costs of ground-up training. Our approach employs activation sparsity to extract experts. To compose experts, we examine the widely-adopted layer-wise router design and show its redundancy, and thus we introduce the pre-gating router decoupled from the MoE backbone that facilitates system-friendly pre-computing and lookahead scheduling, enhancing expert-aware batching and caching. Our codesign therefore addresses critical gaps on both the algorithmic and system fronts, establishing a scalable and efficient alternative for LLM inference in resource-constrained settings. Read-ME outperforms other popular open-source dense models of similar scales, achieving improvements of up to 10.1% on MMLU, and improving mean end-to-end latency up to 6.1%. Codes are available at: https://github.com/VITA-Group/READ-ME.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- M3vit: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. Advances in Neural Information Processing Systems, 35:28441–28457, 2022.
- Mod-squad: Designing mixtures of experts as modular multi-task learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11828–11837, 2023.
- Edge-moe: Memory-efficient multi-task vision transformer architecture with task-level sparsity via mixture-of-experts. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), pages 01–09. IEEE, 2023.
- Serving moe models on resource-constrained edge devices via dynamic expert swapping. arXiv preprint arXiv:2308.15030, 2023.
- Fast inference of mixture-of-experts language models with offloading. arXiv preprint arXiv:2312.17238, 2023.
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
- Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
- Moefication: Transformer feed-forward layers are mixtures of experts. arXiv preprint arXiv:2110.01786, 2021.
- Emergent modularity in pre-trained transformers. arXiv preprint arXiv:2305.18390, 2023.
- Exploiting transformer activation sparsity with dynamic inference. arXiv preprint arXiv:2310.04361, 2023.
- Prompt-prompted mixture of experts for efficient llm generation. arXiv preprint arXiv:2404.01365, 2024.
- Laszlo A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems journal, 5(2):78–101, 1966.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Robust mixture-of-expert training for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 90–101, 2023.
- Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739, 2024.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- On the use of arxiv as a dataset, 2019.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Stack Excahnge. Stack exchange data dump, 2024.
- The lazy neuron phenomenon: On emergence of activation sparsity in transformers. arXiv preprint arXiv:2210.06313, 2022.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
- Accelerate — huggingface.co. https://huggingface.co/docs/accelerate/index. [Accessed 22-05-2024].
- Se-moe: A scalable and efficient mixture-of-experts distributed training and inference system, 2023.
- Moe-infinity: Activation-aware expert offloading for efficient moe serving, 2024.
- Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. arXiv preprint arXiv:2308.12066, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
- Wikimedia Foundation. Wikimedia downloads.
- Deepspeed- inference: Enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2022.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- Openllama: An open reproduction of llama, May 2023.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
- Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720, 2023.
- Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024.
- Laco: Large language model pruning via layer collapse. arXiv preprint arXiv:2402.11187, 2024.
- Compresso: Structured pruning with collaborative prompting learns compact large language models. arXiv preprint arXiv:2310.05015, 2023.
- Huggingface tgi inference engine. https://github.com/huggingface/text-generation-inference. [Accessed 20-05-2024].
- An inter-reference gap model for temporal locality in program behavior. ACM SIGMETRICS Performance Evaluation Review, 23(1):291–300, 1995.
- Learn to be efficient: Build structured sparsity in large language models. arXiv preprint arXiv:2402.06126, 2024.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.
- Hire: High recall approximate top-k𝑘kitalic_k estimation for efficient llm inference. arXiv preprint arXiv:2402.09360, 2024.
- Optimizing dynamic neural networks with brainstorm. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 797–815, 2023.
- Zero-offload: Democratizing billion-scale model training. In USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021.
- Atom: Low-bit quantization for efficient and accurate llm serving. arXiv preprint arXiv:2310.19102, 2023.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Lin Zhao. Awrq: Activation-aware weight reformulation quantizer for large language models.
- The emergence of essential sparsity in large pre-trained models: The weights that matter. Advances in Neural Information Processing Systems, 36, 2024.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Sida-moe: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models, 2024.
- GitHub - NVIDIA/TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. — github.com. https://github.com/NVIDIA/TensorRT-LLM. [Accessed 20-05-2024].
- Ekko: A {{\{{Large-Scale}}\}} deep learning recommender system with {{\{{Low-Latency}}\}} model update. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 821–839, 2022.
- Deepum: Tensor migration and prefetching in unified memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 207–221, 2023.
- Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 598–611. IEEE, 2021.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.