Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning (2407.21571v1)

Published 31 Jul 2024 in cs.CL and cs.AI

Abstract: LLMs encounter significant challenges in continual learning due to catastrophic forgetting, where new information overwrites previously acquired knowledge. This limitation leads to substantial environmental and economic waste. In this study, we introduce the PMoE, Progressive Mixture of Experts with Asymmetric Transformer, which aims to minimize forgetting by utilizing an asymmetric design with shallow layers dedicated to general knowledge and deep layers for new knowledge. PMoE incorporates progressively added experts in deep layers and a router that allocates new knowledge to the appropriate experts efficiently. The router, positioned adjacent to the deep layers, utilizes deep features aggregating consolidated information. This enables the router to perform efficiently, allocating new knowledge to the appropriate experts, which progressively increase in the deep layers. Extensive experiments on TRACE datasets and general language understanding datasets demonstrate that the proposed PMoE outperforms previous state-of-the-art approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Piqa: Reasoning about physical commonsense in natural language. Preprint, arXiv:1911.11641.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Boolq: Exploring the surprising difficulty of natural yes/no questions. Preprint, arXiv:1905.10044.
  4. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
  5. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32.
  6. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin. Preprint, arXiv:2312.09979.
  7. Measuring massive multitask language understanding. Preprint, arXiv:2009.03300.
  8. Parameter-efficient transfer learning for nlp. Preprint, arXiv:1902.00751.
  9. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  10. MeetingBank: A benchmark dataset for meeting summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16409–16423, Toronto, Canada. Association for Computational Linguistics.
  11. Scaling laws for neural language models. Preprint, arXiv:2001.08361.
  12. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
  13. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
  14. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  15. Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947.
  16. David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30.
  17. Learn to explain: Multimodal reasoning via thought chains for science question answering. Preprint, arXiv:2209.09513.
  18. Codexglue: A machine learning benchmark dataset for code understanding and generation. Preprint, arXiv:2102.04664.
  19. Investigating forgetting in pre-trained representations through continual learning. arXiv preprint arXiv:2305.05968.
  20. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  21. Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier.
  22. NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3505–3523, Dublin, Ireland. Association for Computational Linguistics.
  23. Progressive prompts: Continual learning for language models. arXiv preprint arXiv:2301.12314.
  24. A new dataset and efficient baselines for document-level text simplification in German. In Proceedings of the Third Workshop on New Frontiers in Summarization, pages 152–161, Online and in Dominican Republic. Association for Computational Linguistics.
  25. Progressive neural networks. arXiv preprint arXiv:1606.04671.
  26. Trillion dollar words: A new financial dataset, task & market analysis. Preprint, arXiv:2305.07972.
  27. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
  28. Challenging big-bench tasks and whether chain-of-thought can solve them. Preprint, arXiv:2210.09261.
  29. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  30. Label words are anchors: An information flow perspective for understanding in-context learning. Preprint, arXiv:2305.14160.
  31. A comprehensive survey of continual learning: Theory, method and application. Preprint, arXiv:2302.00487.
  32. Orthogonal subspace learning for language model continual learning. arXiv preprint arXiv:2310.14152.
  33. Trace: A comprehensive benchmark for continual learning in large language models. Preprint, arXiv:2310.06762.
  34. Rehearsal-free continual language learning via efficient parameter isolation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10933–10946.
  35. Pretrained language model in continual learning: A comparative study. In International Conference on Learning Representations 2022. OpenReview.
  36. Adaptive budget allocation for parameter-efficient fine-tuning. In International Conference on Learning Representations. Openreview.
  37. C-STANCE: A large dataset for Chinese zero-shot stance detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13369–13385, Toronto, Canada. Association for Computational Linguistics.
  38. Lima: Less is more for alignment. Preprint, arXiv:2305.11206.
  39. Sira: Sparse mixture of low rank adaptation. arXiv preprint arXiv:2311.09179.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Min Jae Jung (2 papers)
  2. Joohee Kim (6 papers)

Summary

Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning

The paper introduces a novel architecture termed Progressive Mixture of Experts (PMoE) with an Asymmetric Transformer design aimed at addressing challenges in continual learning, particularly catastrophic forgetting, in LLMs. This design strategically incorporates an asymmetric setup where shallow layers maintain general knowledge, and deep layers are specifically structured for learning new, task-specific information. PMoE seeks to optimize the learning and retention of knowledge, enhancing both resource efficiency and overall utility of LLMs.

The PMoE architecture stands out through its distinctive mixture of experts model, which progressively increases the number of specialized experts in deeper layers, thereby enabling fine-tuning of parameters in a manner that aligns with the demands of new tasks without overwriting existing knowledge. The incorporation of a routing network adjacent to the deep layers further facilitates this by effectively categorizing and directing new information to the respective experts in the model's deeper structure.

Extensive empirical analyses including experiments on TRACE and general language understanding datasets illustrate that PMoE outperforms existing methods, including the state-of-the-art approaches like LoRA and O-LoRA, in terms of both general and specialized performance metrics. The PMoE outperformed in metrics evaluating general capabilities as well as the tuned abilities post-training, marking a notable advancement across benchmarks.

The implications of these findings are twofold: on a practical level, PMoE demonstrates improved performance and parameter efficiency, crucial for real-world applications where resource constraints are paramount. Theoretically, it offers an insight into the potential of asymmetric architectures in addressing the plasticity-stability dilemma pervasive in continual learning scenarios. The routing mechanism of PMoE, harnessing deep features, reaffirms hypotheses regarding information aggregation throughout LLM layers and points toward avenues for more nuanced architectures in future explorations.

Moving forward, further research could validate these findings across a broader spectrum of tasks and models, possibly extending PMoE’s utility beyond the generative LLMs it was initially designed for. Its asymmetric architecture opens pathways for task-agnostic applications, particularly in diverse and dynamic environments, suggesting an adaptable framework that can be extended to other realms of artificial intelligence beyond LLMs.

In summary, the Progressive Mixture of Experts with Asymmetric Transformer establishes a promising methodology within continual learning, offering significant contributions in advancing LLM parameter efficiency and knowledge retention capabilities, with its versatility presenting fertile ground for ongoing artificial intelligence research.

Youtube Logo Streamline Icon: https://streamlinehq.com