Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models (2407.01906v2)

Published 2 Jul 2024 in cs.CL, cs.AI, and cs.LG
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

Abstract: Parameter-efficient fine-tuning (PEFT) is crucial for customizing LLMs with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness. Our code is available at https://github.com/deepseek-ai/ESFT.

Expert-Specialized Fine-Tuning for Sparse Architectural LLMs

The paper, "Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural LLMs," presents a comprehensive paper on parameter-efficient fine-tuning (PEFT) methods tailored for sparse-architecture LLMs employing the Mixture-of-Experts (MoE) architecture. This research addresses the gap in existing work focused primarily on dense-architecture LLMs, by proposing and evaluating a novel fine-tuning method designed specifically for the MoE paradigm.

Main Contributions

  1. Investigation of Expert Dispersion: The paper investigates the dispersion degree of activated experts across various customized tasks. The findings show that the routing distribution for a specific task tends to be highly concentrated, whereas the distribution of activated experts varies significantly between different tasks. This observation suggests that different tasks activate specialized combinations of experts within the MoE architecture.
  2. Expert-Specialized Fine-Tuning (ESFT): The core contribution is the introduction of Expert-Specialized Fine-Tuning (ESFT). This method focuses on tuning only the experts most relevant to the downstream task while keeping other experts and modules frozen. ESFT aims to maintain expert specialization, thereby preserving task-specific knowledge and improving tuning efficiency.
  3. Impact Analysis of MoE Architecture: The paper provides an in-depth analysis of the impact of MoE architecture on ESFT performance. It demonstrates that models using finer-grained experts allow for more effective selection of task-relevant experts, enhancing both training efficiency and effectiveness.

Methodology

Mixture-of-Experts Architecture

The MoE architecture is central to this paper, where different experts handle different tasks. The model assigns tokens to a subset of most relevant experts, thereby ensuring computational efficiency. The paper builds upon the DeepSeek MoE framework, which introduces fine-grained segmentation of experts to enhance specialization and efficiency.

Expert Relevance Scoring

Two methods for calculating expert relevance are proposed:

  • Average Gate Score (ESFT-Gate): This score computes the average affinity of an expert to tokens from sampled data, providing a measure of how often an expert is engaged by the task.
  • Token Selection Ratio (ESFT-Token): This method calculates the ratio of tokens for which an expert is selected, offering another perspective on expert relevance.

Selection and Fine-Tuning

Only the most relevant experts, as determined by the relevance scores, are fine-tuned. This selective tuning aims to preserve the specialization of these experts, leading to computational efficiency with minimal loss in model performance.

Experimental Results

The evaluation encompasses two primary scenarios:

  1. Enhancement of Specific Domains: Tasks focused on the Math and Code domains, where fine-tuning can yield performance improvements in familiar tasks.
  2. Adaptation to Specialized Tasks: Evaluations on tasks such as Intent Recognition, Text Summarization, Legal Judgment Prediction, and Low-resource Translation, where fine-tuning aids in adapting to less familiar tasks.

Performance Metrics

The paper employs benchmarks like GSM8K, HumanEval, and MMLU, among others, to assess both task-specific performance and the maintenance of general abilities. The results show that ESFT not only matches but sometimes surpasses full-parameter fine-tuning (FFT) while requiring significantly fewer computational resources. Notably, ESFT demonstrated:

  • Efficiency: ESFT methods significantly reduce training time and storage space, with only slight performance trade-offs.
  • Task Specialization: ESFT maintains high task-specific performance by optimizing only the most relevant experts, mitigating the risks of overfitting and catastrophic forgetting seen in FFT.

Theoretical and Practical Implications

The findings of this paper have significant implications:

  • Practical Efficiency: ESFT offers a practical approach to fine-tuning large-scale, sparse-architecture LLMs, making it feasible to customize models for specific tasks without extensive computational resources.
  • Theoretical Insights: This work highlights the importance of expert specialization within MoE architectures, suggesting a direction for future models to leverage fine-grained expert segmentation effectively.
  • Future Developments in AI: Future AI systems can build on the framework of ESFT to dynamically and efficiently adapt to varying tasks, potentially integrating real-time learning capabilities in large-scale models.

Conclusion

The paper provides a robust framework for extending PEFT methods to sparse-architecture LLMs, notably through the ESFT approach. The insights on expert specialization within MoE models and the demonstrated efficiency of ESFT highlight its potential for advancing the customization of LLMs in a computationally efficient manner. The proposed methods set the stage for further exploration into fine-grained expert architectures and their applications in diverse AI tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Composable sparse fine-tuning for cross-lingual transfer. arXiv preprint arXiv:2110.07560.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  3. Evaluating large language models trained on code. In NeurIPS.
  4. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.
  5. Gsm8k: A dataset for grade school math problem solving. In NeurIPS.
  6. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066.
  7. Stablemoe: Stable routing strategy for mixture of experts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 7085–7095. Association for Computational Linguistics.
  8. Databricks. 2024. Dbrx: Resources and code examples.
  9. DeepSeek. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. CoRR, abs/2405.04434.
  10. Sparse low-rank adaptation of pre-trained language models. arXiv preprint arXiv:2311.11696.
  11. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961.
  12. A note on lora. arXiv preprint arXiv:2404.05086.
  13. Cross-attention is all you need: Adapting pretrained transformers for machine translation. arXiv preprint arXiv:2104.08771.
  14. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463.
  15. Parameter-efficient fine-tuning for large models: A comprehensive survey. CoRR, abs/2403.14608.
  16. Sensitivity-aware visual parameter-efficient fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11825–11835.
  17. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
  18. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  19. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR).
  20. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  21. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  22. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  23. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv e-prints, arXiv:1705.03551.
  24. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net.
  25. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  26. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
  27. Parameter-efficient fine-tuning without introducing new latency. arXiv preprint arXiv:2305.16742.
  28. Lora dropout as a sparsity regularizer for overfitting control. arXiv preprint arXiv:2404.09610.
  29. Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv:2310.18339.
  30. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
  31. Wizardcoder: Empowering code large language models with evol-instruct.
  32. Meta. 2023a. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  33. Meta. 2023b. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  34. Meta. 2024. Llama 3 model card.
  35. Mistral. 2024a. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all.
  36. Mistral. 2024b. Mixtral of experts. CoRR, abs/2401.04088.
  37. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247.
  38. Qwen. 2024. Introducing qwen1.5.
  39. Hash layers for large sparse models. CoRR, abs/2106.04426.
  40. Jetmoe: Reaching llama2 performance with 0.1m dollars. CoRR, abs/2404.07413.
  41. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34:24193–24205.
  42. Efficient fine-tuning of bert models on the edge. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1838–1842. IEEE.
  43. Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. arXiv preprint arXiv:2205.12410, 1(2):4.
  44. XAI. 2024. Grok open release.
  45. Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986.
  46. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687.
  47. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  48. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics.
  49. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
  50. Chren: Cherokee-english machine translation for endangered language revitalization. In EMNLP2020.
  51. Towards adaptive prefix tuning for parameter-efficient language model fine-tuning. arXiv preprint arXiv:2305.15212.
  52. Instruction-following evaluation for large language models. Preprint, arXiv:2311.07911.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zihan Wang (181 papers)
  2. Deli Chen (20 papers)
  3. Damai Dai (38 papers)
  4. Runxin Xu (30 papers)
  5. Zhuoshu Li (7 papers)
  6. Y. Wu (639 papers)