Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
112 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Language Plasticity via Pretraining with Active Forgetting (2307.01163v3)

Published 3 Jul 2023 in cs.CL, cs.LG, and cs.NE

Abstract: Pretrained LLMs (PLMs) are today the primary model for natural language processing. Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible. While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient. We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages. Concretely, by resetting the embedding layer every K updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within a limited number of updates, similar to a meta-learning effect. Experiments with RoBERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation but also outperform standard ones in a low-data regime, particularly for languages that are distant from English.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Masakhaner: Named entity recognition for african languages. Transactions of the Association for Computational Linguistics, 9:1116–1131, 2021.
  2. Masakhaner 2.0: Africa-centric transfer learning for named entity recognition. In 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 4488–4508. Association for Computational Linguistics (ACL), 2022.
  3. The impact of reinitialization on generalization in convolutional neural networks. arXiv preprint arXiv:2109.00267, 2021.
  4. Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.382.
  5. Active forgetting: Adaptation of memory by prefrontal control. Annual Review of Psychology, 72(1):1–36, 2021. doi: 10.1146/annurev-psych-072720-094140. URL https://doi.org/10.1146/annurev-psych-072720-094140. PMID: 32928060.
  6. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016.
  7. Composable sparse fine-tuning for cross-lingual transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1778–1796, 2022.
  8. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, 2020.
  9. The role of forgetting in the evolution and learning of language. Journal of Experimental & Theoretical Artificial Intelligence, 21(4):293–309, 2009.
  10. Learning to continually learn. arXiv preprint arXiv:2002.09571, 2020.
  11. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  12. Refactor gnns: Revisiting factorisation-based models from a message-passing perspective. In Advances in Neural Information Processing Systems, 2022.
  13. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1269. URL https://aclanthology.org/D18-1269.
  14. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  15. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  16. Sample-efficient reinforcement learning by breaking the replay ratio barrier. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=OpC-9aBBVJe.
  17. Americasnli: Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6279–6299, 2022.
  18. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
  19. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
  20. Building a subspace of policies for scalable continual learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=UKr0MwZM6fL.
  21. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, 2020.
  22. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2021a.
  23. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=XPZIaotutsD.
  24. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
  25. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  26. Transient non-stationarity and generalisation in deep reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Qun8fv4qSby.
  27. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  28. Towards continual reinforcement learning: A review and perspectives. Journal of Artificial Intelligence Research, 75:1401–1476, 2022.
  29. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  30. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, 2018.
  31. Inhibiting your native language: The role of retrieval-induced forgetting during second-language acquisition. Psychological Science, 18(1):29–34, 2007.
  32. Mlqa: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315–7330, 2020.
  33. Delving deeper into cross-lingual visual question answering. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2408–2423, 2023a.
  34. Same pre-training loss, better downstream: Implicit bias matters for language models. In Proceedings of the 40th International Conference on Machine Learning, 2023b.
  35. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
  36. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
  37. Understanding plasticity in neural networks. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 23190–23211. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/lyle23b.html.
  38. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
  39. Mini-model adaptation: Efficiently extending pretrained models to new languages via aligned shallow training. arXiv preprint arXiv:2212.10503, 2022.
  40. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages 109–165. Academic Press, 1989. doi: https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368.
  41. Fast model editing at scale. In International Conference on Learning Representations, 2021.
  42. Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831. PMLR, 2022.
  43. The primacy bias in deep reinforcement learning. In International Conference on Machine Learning, pages 16828–16847. PMLR, 2022.
  44. Deep reinforcement learning with plasticity injection. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023. URL https://openreview.net/forum?id=O9cJADBZT1.
  45. Simon Nørby. Why forget? on the adaptive value of memory loss. Perspectives on Psychological Science, 10(5):551–578, 2015. doi: 10.1177/1745691615596787. URL https://doi.org/10.1177/1745691615596787. PMID: 26385996.
  46. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, 2019.
  47. Continual lifelong learning with neural networks: A review. Neural networks, 113:54–71, 2019.
  48. Oscillatory brain activity before and after an internal context change—evidence for a reset of encoding processes. NeuroImage, 43(1):173–181, 2008.
  49. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.617. URL https://aclanthology.org/2020.emnlp-main.617.
  50. Unks everywhere: Adapting multilingual language models to new scripts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10186–10203, 2021.
  51. Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3479–3495, 2022.
  52. Improving language understanding by generative pre-training. 2018.
  53. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016.
  54. Learn, unlearn and relearn: An online learning paradigm for deep neural networks. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=WN1O2MJDST.
  55. Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990.
  56. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
  57. Metalearning. Scholarpedia, 5(6):4650, 2010.
  58. Jürgen Schmidhuber. Powerplay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. Frontiers in psychology, 4:313, 2013.
  59. Progress & compress: A scalable framework for continual learning. In International conference on machine learning, pages 4528–4537. PMLR, 2018.
  60. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
  61. Knowledge evolution in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12843–12852, 2021.
  62. Multilingual translation with extensible multilingual pretraining and finetuning. ArXiv, abs/2008.00401, 2020. URL https://api.semanticscholar.org/CorpusID:220936592.
  63. Learning to learn. Springer Science & Business Media, 2012.
  64. Llama: Open and efficient foundation language models, 2023.
  65. Efficient continual learning with modular networks and task-driven priors. arXiv preprint arXiv:2012.12631, 2020.
  66. Ethical and social risks of harm from language models. CoRR, abs/2112.04359, 2021. URL https://arxiv.org/abs/2112.04359.
  67. Taxonomy of risks posed by language models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214–229, 2022.
  68. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, 2018.
  69. Fortuitous forgetting in connectionist networks. In International Conference on Learning Representations, 2022.
Citations (19)

Summary

  • The paper introduces active forgetting during pretraining to significantly enhance PLM adaptation to low-resource and diverse languages.
  • It resets token embeddings at regular intervals to simulate varied linguistic conditions, yielding performance gains of up to +60.9% on benchmarks.
  • The results underscore improved efficiency in adapting models with limited data and pave the way for future research on dynamic training strategies.

Improving Language Plasticity via Pretraining with Active Forgetting

Recent advancements in Pretrained LLMs (PLMs) have significantly impacted the field of NLP, achieving notable results across various standardized benchmarks. Despite these successes, challenges remain in adapting PLMs to new languages efficiently, particularly those distant from the original training language. This paper introduces an approach called "active forgetting" during pretraining to enhance the plasticity of LLMs, allowing them to adapt more seamlessly to new languages with limited data.

Core Concepts and Methodology

Pretrained models like RoBERTa store linguistic knowledge in their parameters during the pretraining phase. Transferring this knowledge to new languages typically involves finetuning an embedding layer, but conventional methods demand substantial data and computational resources. The proposed solution involves periodically resetting the token embedding layer during pretraining—termed active forgetting—thereby encouraging the model to refine its adaptation mechanisms.

The active forgetting mechanism operates by resetting token embeddings at regular intervals during training, effectively simulating exposure to diverse linguistic conditions. This process is akin to a meta-learning strategy, enhancing the model's robustness and facilitating faster adaptation during subsequent language-specific finetuning phases.

Numerical Results and Performance

Empirical evaluations were conducted on cross-lingual benchmarks such as XNLI, MLQA, and XQuAD. The paper demonstrated substantial improvements in model performance when adapting with limited data:

  • For XNLI, the model achieved an average relative gain of +21.2% compared to standard PLMs.
  • On MLQA, a relative gain of +33.8% was recorded.
  • An even more significant improvement of +60.9% was observed on XQuAD.

The results indicate that active forgetting substantially enhances the model's ability to generalize to new languages, especially those linguistically distant from the pretraining dataset such as Arabic, Hindi, and Turkish.

Implications and Future Directions

The active forgetting approach underscores the potential of dynamic training strategies for developing more versatile and adaptive LLMs. By fostering linguistic plasticity, PLMs can better cope with novel linguistic inputs and minimize the data and computational demands typical of traditional adaptation processes.

Future research may explore extending this approach to other model architectures and training paradigms, potentially incorporating advanced forgetting techniques like noise injection. Furthermore, understanding the theoretical underpinnings of how such mechanisms affect learning, possibly through the lens of flatness in the loss landscape, can provide deeper insights into optimizing PLM training for adaptability.

In conclusion, active forgetting presents a promising avenue for advancing PLM adaptability, offering significant efficiency improvements for multilingual support and signaling a step towards more flexible, domain-agnostic artificial intelligence systems.