Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models (2305.14705v2)

Published 24 May 2023 in cs.CL
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models

Abstract: Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to LLMs without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance LLMs in the framework of task-agnostic learning.

Introduction

In AI and NLP, LLMs have significantly advanced the field, enabling a better understanding of human language. The prevalent approach to enhancing model performance across tasks has been to make these models larger and more sophisticated. However, the size and complexity of such models also result in a substantial increase in computational cost. Mixture-of-Experts (MoE), which incorporates sparsity within neural networks, and instruction tuning, which involves refining model behavior to follow instructions, are two emerging strategies that aim to maximize LLM efficiency and effectiveness. This paper exposes the convergence of these two techniques—demonstrating their synergistic potential in scaling the benefits of LLMs while keeping computational overhead in check.

Method

The authors introduce an approach that merges sparse MoE architectures with the process of instruction tuning. MoE models incorporate various sub-models or "experts," each attuned to specific parts of the data, allowing targeted and efficient computation. By contrast, dense models, which uniformly utilize network parameters, struggle with resource allocation for complex tasks. The suggested MoE models, however, exhibit a tendency to falter when faced with limited fine-tuning data. The notion of instruction tuning comes into play, addressing this shortcoming by equipping these models to better accommodate instruction-based tasks.

Experiment

The paper presents an empirical investigation into the beneficial interaction between sparse MoE methods and instruction tuning using the developed model FLAN-MoE. This model was subjected to a series of tests, including individual task fine-tuning and instructional tuning, along with evaluations in natural language understanding, reasoning, question answering, and other NLP tasks. The results from these tests are used to assess the enhancements brought about by the integration of MoE and instruction tuning strategies. Notably, FLAN-MoE significantly outperformed its dense model counterparts in instruction tuning scenarios and demonstrated comparable or superior task performance while utilizing fewer computational resources.

Discussion

In this paper, the integration of two distinct but potentially complementary approaches—MoE models and instruction tuning—yields remarkable improvements in LLM performance on a range of language tasks. FLAN-MoE advances the field by increasing model efficiency, generalization to unseen tasks, and scaling without the corresponding rise in computation. The paper provides valuable insights into the optimal configuration of gating mechanisms, the role of auxiliary loss during finetuning, and the model's resilience to overfitting. While FLAN-MoE sets new benchmarks in task performance, it also highlights challenges such as multilingual task handling, indicating future research directions. This work prompts a reevaluation of the design principles for scalable, high-performance LLMs and sets a precedent for combining sparse neural network topologies with adaptive, instruction-following capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952, 2021.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  5. Unified scaling laws for routed language models. In ICML, pages 4057–4086. PMLR, 2022.
  6. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  7. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  10. Glam: Efficient scaling of language models with mixture-of-experts. In ICML, pages 5547–5569. PMLR, 2022.
  11. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
  12. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021.
  13. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
  14. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  15. Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021.
  16. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  17. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
  18. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
  19. Unifying question answering, text classification, and regression via span extraction. arXiv preprint arXiv:1904.09286, 2019.
  20. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700, 2020.
  21. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
  22. Beyond distillation: Task-level mixture-of-experts for efficient inference. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3577–3599, 2021.
  23. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  24. BASE layers: Simplifying training of large, sparse models. In ICML. PMLR, 2021.
  25. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  26. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  27. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019.
  28. The flan collection: Designing data and methods for effective instruction tuning. In ICML, 2023.
  29. Cross-token modeling with conditional computation. arXiv preprint arXiv:2109.02008, 2021.
  30. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018. ACM, 2018.
  31. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.
  32. A diverse corpus for evaluating and developing english math word problem solvers. In ACL, 2020.
  33. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
  34. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
  35. Multimodal contrastive learning with limoe: the language-image mixture of experts. arXiv preprint arXiv:2206.02770, 2022.
  36. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  37. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022.
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  39. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
  41. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 2021.
  42. Hash layers for large sparse models. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021.
  43. Multitask prompted training enables zero-shot task generalization. In ICLR, 2022.
  44. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR. OpenReview.net, 2017.
  45. Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
  46. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  47. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  48. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  49. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  50. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In EMNLP, 2022.
  51. Finetuned language models are zero-shot learners. In ICLR, 2022.
  52. Chain of thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
  53. Crossfit: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835, 2021.
  54. Mixture-of-experts with expert choice routing. In Advances in Neural Information Processing Systems, 2022.
  55. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.
  56. Taming sparsely activated transformer with stochastic experts. arXiv preprint arXiv:2110.04260, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (20)
  1. Sheng Shen (68 papers)
  2. Le Hou (36 papers)
  3. Yanqi Zhou (30 papers)
  4. Nan Du (66 papers)
  5. Shayne Longpre (49 papers)
  6. Jason Wei (49 papers)
  7. Hyung Won Chung (30 papers)
  8. Barret Zoph (38 papers)
  9. William Fedus (25 papers)
  10. Xinyun Chen (80 papers)
  11. Tu Vu (24 papers)
  12. Yuexin Wu (23 papers)
  13. Wuyang Chen (32 papers)
  14. Albert Webson (19 papers)
  15. Yunxuan Li (14 papers)
  16. Vincent Zhao (8 papers)
  17. Hongkun Yu (17 papers)
  18. Kurt Keutzer (199 papers)
  19. Trevor Darrell (324 papers)
  20. Denny Zhou (65 papers)
Citations (43)
Youtube Logo Streamline Icon: https://streamlinehq.com