Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models (2305.14705v2)

Published 24 May 2023 in cs.CL

Abstract: Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to LLMs without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance LLMs in the framework of task-agnostic learning.

PDF HTML Abstract

Introduction

In AI and NLP, LLMs have significantly advanced the field, enabling a better understanding of human language. The prevalent approach to enhancing model performance across tasks has been to make these models larger and more sophisticated. However, the size and complexity of such models also result in a substantial increase in computational cost. Mixture-of-Experts (MoE), which incorporates sparsity within neural networks, and instruction tuning, which involves refining model behavior to follow instructions, are two emerging strategies that aim to maximize LLM efficiency and effectiveness. This paper exposes the convergence of these two techniques—demonstrating their synergistic potential in scaling the benefits of LLMs while keeping computational overhead in check.

Method

The authors introduce an approach that merges sparse MoE architectures with the process of instruction tuning. MoE models incorporate various sub-models or "experts," each attuned to specific parts of the data, allowing targeted and efficient computation. By contrast, dense models, which uniformly utilize network parameters, struggle with resource allocation for complex tasks. The suggested MoE models, however, exhibit a tendency to falter when faced with limited fine-tuning data. The notion of instruction tuning comes into play, addressing this shortcoming by equipping these models to better accommodate instruction-based tasks.

Experiment

The paper presents an empirical investigation into the beneficial interaction between sparse MoE methods and instruction tuning using the developed model FLAN-MoE. This model was subjected to a series of tests, including individual task fine-tuning and instructional tuning, along with evaluations in natural language understanding, reasoning, question answering, and other NLP tasks. The results from these tests are used to assess the enhancements brought about by the integration of MoE and instruction tuning strategies. Notably, FLAN-MoE significantly outperformed its dense model counterparts in instruction tuning scenarios and demonstrated comparable or superior task performance while utilizing fewer computational resources.

Discussion

In this paper, the integration of two distinct but potentially complementary approaches—MoE models and instruction tuning—yields remarkable improvements in LLM performance on a range of language tasks. FLAN-MoE advances the field by increasing model efficiency, generalization to unseen tasks, and scaling without the corresponding rise in computation. The paper provides valuable insights into the optimal configuration of gating mechanisms, the role of auxiliary loss during finetuning, and the model's resilience to overfitting. While FLAN-MoE sets new benchmarks in task performance, it also highlights challenges such as multilingual task handling, indicating future research directions. This work prompts a reevaluation of the design principles for scalable, high-performance LLMs and sets a precedent for combining sparse neural network topologies with adaptive, instruction-following capabilities.

PDF Markdown Bookmark Chat (Pro)

References (56)

Authors (20)

Sheng Shen (68 papers)
Le Hou (36 papers)
Yanqi Zhou (30 papers)
Nan Du (66 papers)
Shayne Longpre (49 papers)
Jason Wei (49 papers)
Hyung Won Chung (30 papers)
Barret Zoph (38 papers)
William Fedus (25 papers)
Xinyun Chen (80 papers)
Tu Vu (24 papers)
Yuexin Wu (23 papers)
Wuyang Chen (32 papers)
Albert Webson (19 papers)
Yunxuan Li (14 papers)
Vincent Zhao (8 papers)
Hongkun Yu (17 papers)
Kurt Keutzer (199 papers)
Trevor Darrell (324 papers)
Denny Zhou (65 papers)

Citations (43)

View on Semantic Scholar

Tweets

https://twitter.com/Euclaise_/status/1776255738563682527

https://twitter.com/tuvllms/status/1904965286781947998

YouTube

Show All Videos