Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Benefits of Training Expert Language Models over Instruction Tuning (2302.03202v2)

Published 7 Feb 2023 in cs.CL

Abstract: Recently, LLMs (LMs) instruction-tuned on multiple tasks, also known as multitask-prompted fine-tuning (MT), have shown the capability to generalize to unseen tasks. Previous work has shown that scaling the number of training tasks is the key component in making stronger MT LMs. In this work, we report an unexpected finding that an expert LM fine-tuned on just a single task can outperform an MT LM trained with 300+ different tasks on 11 different unseen datasets and on 13 datasets of the BIG-bench benchmark by a mean accuracy of 3.20% and 1.29%, respectively. This finding casts doubt on the previously held belief that simply scaling the number of tasks makes stronger MT LMs. Leveraging this finding, we further show that this distributed approach of training a separate expert LM per training task instead of a single MT LM for zero-shot inference possesses many benefits including (1) avoiding negative task transfer that often occurs during instruction tuning, (2) being able to continually learn new tasks without having to re-train on previous tasks to avoid catastrophic forgetting, and (3) showing compositional capabilities when merging individual experts together. The code is available at https://github.com/joeljang/ELM.

Citations (70)

Summary

  • The paper demonstrates that single-task expert LMs can outperform multitask LMs, achieving improvements of 3.20% and 1.29% in mean accuracy on unseen tasks.
  • The paper highlights a distributed expert LM approach that avoids negative task transfer and catastrophic forgetting while enhancing continual learning.
  • The research challenges conventional wisdom by showing that task-specific specialization offers a more efficient and adaptable alternative to extensive instruction tuning.

Evaluating Single-Task Expert LLMs and Their Generalization Capabilities

The recent surge in the performance of LLMs (LMs) can be attributed to instruction tuning, a technique that involves fine-tuning LMs on multiple tasks with prompted instructions. Such multitask-prompted fine-tuning has developed models showing robust generalization abilities on unseen tasks. The prevalent consensus has been that the efficacy of multitask LMs (MT LMs) increases with the number of training tasks. The paper authored by Jang et al., "Exploring the Benefits of Training Expert LLMs over Instruction Tuning," challenges this principle by presenting empirical evidence that expert LMs—models fine-tuned on a single task—can perform comparably or superior to MT LMs across various unseen tasks.

Significant Findings

The authors illustrate a pivotal finding: an expert LM, fine-tuned on a single task, can outperform a multitask LM fine-tuned on over 300 tasks in terms of mean accuracy on 11 unseen datasets and on 13 datasets from the BIG-bench benchmark. Specifically, they highlight that a single-task expert can surpass T0-3B—an established MT LM—by margins of 3.20% and 1.29% in mean accuracy across these datasets. This outcome raises questions about the strategy of increasing the number of tasks to enhance the generalization capabilities of multitask LMs.

Furthermore, the paper demonstrates the advantage of a distributed model which trains a distinct expert LM for each task. This approach presents benefits such as the avoidance of negative task transfer, enhanced continual learning capability without catastrophic forgetting, and the possibility of compositional task-solving through the union of multiple experts.

Implications and Speculations

The implications of this research are profound. Theoretically, it challenges the perception that scaling the number of tasks is a necessary condition for a model to excel at diverse unseen tasks. Practically, the approach provides a framework for developing LMs that are more efficient in terms of computational resources as they do not necessitate retraining on extensive task batches, thus promoting adaptability and agility in evolving task environments.

Moving forward, one could envision further exploration into optimal strategies for expert retrieval mechanisms during inference, which could leverage learned representations or entail training a retrieval-specific model. Moreover, the prospect of task-compositional or federated learning opens up intriguing possibilities for individual LMs to collaborate, integrate capabilities, and derive synergies that a singular model might find challenging.

Future Directions

The research paves the way for various future exploration avenues, one being the expansion of the expert-based model to LLMs beyond 11B parameters where capacity might dilute negative task transfer incidences, thus benefiting from large-scale multitask learning. Another interesting trajectory would be the refinement of retrieval mechanisms, which could dynamically assign tasks to the most appropriate experts.

Overall, this paper reframes how the computational linguistics community may consider fine-tuning strategies for LLMs. By shifting towards a model that values specialization and focuses on task-specific fine-tuning within a distributed framework, this approach charts a course for more flexible, efficient, and theoretically nuanced LLMing paradigms.