Emergent Mind


Recently, there has been a widespread proliferation of "expert" language models that are specialized to a specific task or domain through parameter-efficient fine-tuning. How can we recycle large collections of expert language models to improve zero-shot generalization to unseen tasks? In this work, we propose Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE), which learns to route among specialized modules that were produced through parameter-efficient fine-tuning. Unlike past methods that learn to route among specialized models, PHATGOOSE explores the possibility that zero-shot generalization will be improved if different experts can be adaptively chosen for each token and at each layer in the model. Crucially, our method is post-hoc - it does not require simultaneous access to the datasets used to create the specialized models and only requires a modest amount of additional compute after each expert model is trained. In experiments covering a range of specialized model collections and zero-shot generalization benchmarks, we find that PHATGOOSE outperforms past methods for post-hoc routing and, in some cases, outperforms explicit multitask training (which requires simultaneous data access). To better understand the routing strategy learned by PHATGOOSE, we perform qualitative experiments to validate that PHATGOOSE's performance stems from its ability to make adaptive per-token and per-module expert choices. We release all of our code to support future work on improving zero-shot generalization by recycling specialized experts.


  • PHATGOOSE introduces an innovative approach for improving zero-shot generalization by routing among specialized language models without needing their original training data.

  • The method employs post-hoc, tokenwise gating on specialized models that have been fine-tuned using parameter-efficient techniques, aiming for flexible use of experts' knowledge.

  • PHATGOOSE outperforms existing post-hoc routing methods and some multitask training approaches in zero-shot generalization tasks across various benchmarks.

  • The approach suggests a promising future for model development, emphasizing decentralized efforts and the potential for diverse, effective routing strategies.


The paper discusses a novel approach named Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE), designed for recycling large collections of specialized expert language models to improve zero-shot generalization to unseen tasks. This method contrasts with traditional approaches by offering a more flexible, efficient, and post-hoc strategy for leveraging a wealth of pre-existing specialized models, without requiring simultaneous access to the datasets used for their training. The authors rigorously evaluate PHATGOOSE across a range of benchmarks and against several baselines, demonstrating its efficacy in enhancing zero-shot generalization capabilities.


PHATGOOSE routes among specialized modules produced through parameter-efficient fine-tuning (PEFT) methods. It introduces a novel gate-training step that is applied post-hoc, meaning after each expert model is trained. This step trains a sigmoid gate for each module, determining whether or not a given activation should use the PEFT module. Unlike other methods, PHATGOOSE adapts per-token and per-module, aiming to better generalize by leveraging different expert capabilities at different stages or for different pieces of input.


The experiments demonstrate that PHATGOOSE outperforms existing methods for post-hoc routing and, in some cases, even explicit multitask training, across different specialized model collections and zero-shot generalization benchmarks. For the T0. Held-In setting, PHATGOOSE nearly matches the performance of an oracle routing scheme with significant improvements visible on the T0. Held-Out tasks. When expanding the pool of experts in the FLAN setting, PHATGOOSE's relative performance improves further, showcasing its scalability and robustness across larger sets of expert models.


A qualitative analysis of PHATGOOSE's performance reveals it can learn diverse routing strategies that differ from simple oracle routing yet still perform effectively. This flexibility points to the model's ability to combine abilities from multiple experts, tailoring its routing strategy to the specific demands of each task or input token. Such adaptability is crucial for improving zero-shot generalization performance, as shown through experiments where PHATGOOSE outperforms retrieval-based methods and static merging strategies.

Implications and Future Work

PHATGOOSE's performance offers promising implications for the future of model development, especially in the context of decentralized, collaborative efforts. By allowing individual contributors to improve zero-shot generalization capabilities of a model without needing to access centralized, massive compute resources or datasets, PHATGOOSE democratizes the process of creating generalist AI systems. The authors suggest that future work could explore applying PHATGOOSE to other model architectures and investigate its performance with heterogeneous module architectures, potentially yielding even further gains in efficiency and effectiveness.


In conclusion, PHATGOOSE represents a significant leap forward in leveraging the collective power of specialized expert models for improving zero-shot generalization. Its approach to training and routing decisions—adaptive, tokenwise, post-hoc—demonstrates superior flexibility and performance across various settings, even in comparison to more traditional multitask training methods. As the AI field moves towards more decentralized and collaborative model development strategies, PHATGOOSE offers an effective and efficient pathway for enhancing the capabilities of generalist language models through the recycling of specialized expertise.

Get summaries of trending AI/ML papers delivered straight to your inbox

Unsubscribe anytime.