Learning to Route Among Specialized Experts for Zero-Shot Generalization (2402.05859v2)

Published 8 Feb 2024 in cs.LG

Abstract: Recently, there has been a widespread proliferation of "expert" LLMs that are specialized to a specific task or domain through parameter-efficient fine-tuning. How can we recycle large collections of expert LLMs to improve zero-shot generalization to unseen tasks? In this work, we propose Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE), which learns to route among specialized modules that were produced through parameter-efficient fine-tuning. Unlike past methods that learn to route among specialized models, PHATGOOSE explores the possibility that zero-shot generalization will be improved if different experts can be adaptively chosen for each token and at each layer in the model. Crucially, our method is post-hoc - it does not require simultaneous access to the datasets used to create the specialized models and only requires a modest amount of additional compute after each expert model is trained. In experiments covering a range of specialized model collections and zero-shot generalization benchmarks, we find that PHATGOOSE outperforms past methods for post-hoc routing and, in some cases, outperforms explicit multitask training (which requires simultaneous data access). To better understand the routing strategy learned by PHATGOOSE, we perform qualitative experiments to validate that PHATGOOSE's performance stems from its ability to make adaptive per-token and per-module expert choices. We release all of our code to support future work on improving zero-shot generalization by recycling specialized experts.

Citations (22)

View on Semantic Scholar

Summary

The paper introduces PHATGOOSE, a novel post-hoc tokenwise gating mechanism to route among specialized expert models for enhanced zero-shot generalization.
It leverages parameter-efficient fine-tuning and dynamic gating to adapt per token and per module, outperforming traditional and retrieval-based routing methods.
Experiments show that PHATGOOSE scales effectively across diverse expert pools, nearly matching oracle routing on many held-out tasks.

Introduction

The paper discusses a novel approach named Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE), designed for recycling large collections of specialized expert LLMs to improve zero-shot generalization to unseen tasks. This method contrasts with traditional approaches by offering a more flexible, efficient, and post-hoc strategy for leveraging a wealth of pre-existing specialized models, without requiring simultaneous access to the datasets used for their training. The authors rigorously evaluate PHATGOOSE across a range of benchmarks and against several baselines, demonstrating its efficacy in enhancing zero-shot generalization capabilities.

Approach

PHATGOOSE routes among specialized modules produced through parameter-efficient fine-tuning (PEFT) methods. It introduces a novel gate-training step that is applied post-hoc, meaning after each expert model is trained. This step trains a sigmoid gate for each module, determining whether or not a given activation should use the PEFT module. Unlike other methods, PHATGOOSE adapts per-token and per-module, aiming to better generalize by leveraging different expert capabilities at different stages or for different pieces of input.

Performance

The experiments demonstrate that PHATGOOSE outperforms existing methods for post-hoc routing and, in some cases, even explicit multitask training, across different specialized model collections and zero-shot generalization benchmarks. For the T0. Held-In setting, PHATGOOSE nearly matches the performance of an oracle routing scheme with significant improvements visible on the T0. Held-Out tasks. When expanding the pool of experts in the FLAN setting, PHATGOOSE's relative performance improves further, showcasing its scalability and robustness across larger sets of expert models.

Analysis

A qualitative analysis of PHATGOOSE's performance reveals it can learn diverse routing strategies that differ from simple oracle routing yet still perform effectively. This flexibility points to the model's ability to combine abilities from multiple experts, tailoring its routing strategy to the specific demands of each task or input token. Such adaptability is crucial for improving zero-shot generalization performance, as shown through experiments where PHATGOOSE outperforms retrieval-based methods and static merging strategies.

Implications and Future Work

PHATGOOSE's performance offers promising implications for the future of model development, especially in the context of decentralized, collaborative efforts. By allowing individual contributors to improve zero-shot generalization capabilities of a model without needing to access centralized, massive compute resources or datasets, PHATGOOSE democratizes the process of creating generalist AI systems. The authors suggest that future work could explore applying PHATGOOSE to other model architectures and investigate its performance with heterogeneous module architectures, potentially yielding even further gains in efficiency and effectiveness.

Conclusion

In conclusion, PHATGOOSE represents a significant leap forward in leveraging the collective power of specialized expert models for improving zero-shot generalization. Its approach to training and routing decisions—adaptive, tokenwise, post-hoc—demonstrates superior flexibility and performance across various settings, even in comparison to more traditional multitask training methods. As the AI field moves towards more decentralized and collaborative model development strategies, PHATGOOSE offers an effective and efficient pathway for enhancing the capabilities of generalist LLMs through the recycling of specialized expertise.

PDF Markdown

Related Papers

Tweets

https://twitter.com/colinraffel/status/1755770081475219823

https://twitter.com/felix_red_panda/status/1769453933804572756

https://twitter.com/s_scardapane/status/1755955747987321098

https://twitter.com/fly51fly/status/1756433311600402527

https://twitter.com/murefil/status/1859623073449705887

https://twitter.com/walter4096/status/1783200676518244546

YouTube

Show All Videos