PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning (2407.21571v1)

Published 31 Jul 2024 in cs.CL and cs.AI

Abstract: LLMs encounter significant challenges in continual learning due to catastrophic forgetting, where new information overwrites previously acquired knowledge. This limitation leads to substantial environmental and economic waste. In this study, we introduce the PMoE, Progressive Mixture of Experts with Asymmetric Transformer, which aims to minimize forgetting by utilizing an asymmetric design with shallow layers dedicated to general knowledge and deep layers for new knowledge. PMoE incorporates progressively added experts in deep layers and a router that allocates new knowledge to the appropriate experts efficiently. The router, positioned adjacent to the deep layers, utilizes deep features aggregating consolidated information. This enables the router to perform efficiently, allocating new knowledge to the appropriate experts, which progressively increase in the deep layers. Extensive experiments on TRACE datasets and general language understanding datasets demonstrate that the proposed PMoE outperforms previous state-of-the-art approaches.

References (39)

Authors (2)

Min Jae Jung (2 papers)
Joohee Kim (6 papers)

Summary

Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning

The paper introduces a novel architecture termed Progressive Mixture of Experts (PMoE) with an Asymmetric Transformer design aimed at addressing challenges in continual learning, particularly catastrophic forgetting, in LLMs. This design strategically incorporates an asymmetric setup where shallow layers maintain general knowledge, and deep layers are specifically structured for learning new, task-specific information. PMoE seeks to optimize the learning and retention of knowledge, enhancing both resource efficiency and overall utility of LLMs.

The PMoE architecture stands out through its distinctive mixture of experts model, which progressively increases the number of specialized experts in deeper layers, thereby enabling fine-tuning of parameters in a manner that aligns with the demands of new tasks without overwriting existing knowledge. The incorporation of a routing network adjacent to the deep layers further facilitates this by effectively categorizing and directing new information to the respective experts in the model's deeper structure.

Extensive empirical analyses including experiments on TRACE and general language understanding datasets illustrate that PMoE outperforms existing methods, including the state-of-the-art approaches like LoRA and O-LoRA, in terms of both general and specialized performance metrics. The PMoE outperformed in metrics evaluating general capabilities as well as the tuned abilities post-training, marking a notable advancement across benchmarks.

The implications of these findings are twofold: on a practical level, PMoE demonstrates improved performance and parameter efficiency, crucial for real-world applications where resource constraints are paramount. Theoretically, it offers an insight into the potential of asymmetric architectures in addressing the plasticity-stability dilemma pervasive in continual learning scenarios. The routing mechanism of PMoE, harnessing deep features, reaffirms hypotheses regarding information aggregation throughout LLM layers and points toward avenues for more nuanced architectures in future explorations.

Moving forward, further research could validate these findings across a broader spectrum of tasks and models, possibly extending PMoE’s utility beyond the generative LLMs it was initially designed for. Its asymmetric architecture opens pathways for task-agnostic applications, particularly in diverse and dynamic environments, suggesting an adaptable framework that can be extended to other realms of artificial intelligence beyond LLMs.

In summary, the Progressive Mixture of Experts with Asymmetric Transformer establishes a promising methodology within continual learning, offering significant contributions in advancing LLM parameter efficiency and knowledge retention capabilities, with its versatility presenting fertile ground for ongoing artificial intelligence research.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/tunadorable/status/1830923702625607718

https://twitter.com/JagersbergKnut/status/1830956609654112487

https://twitter.com/gm8xx8/status/1819185817220272273

YouTube

Show All Videos