Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

112 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Improving Language Plasticity via Pretraining with Active Forgetting (2307.01163v3)

Published 3 Jul 2023 in cs.CL, cs.LG, and cs.NE

Abstract: Pretrained LLMs (PLMs) are today the primary model for natural language processing. Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible. While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient. We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages. Concretely, by resetting the embedding layer every K updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within a limited number of updates, similar to a meta-learning effect. Experiments with RoBERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation but also outperform standard ones in a low-data regime, particularly for languages that are distant from English.

References (69)

Citations (19)

View on Semantic Scholar

Summary

The paper introduces active forgetting during pretraining to significantly enhance PLM adaptation to low-resource and diverse languages.
It resets token embeddings at regular intervals to simulate varied linguistic conditions, yielding performance gains of up to +60.9% on benchmarks.
The results underscore improved efficiency in adapting models with limited data and pave the way for future research on dynamic training strategies.

Improving Language Plasticity via Pretraining with Active Forgetting

Recent advancements in Pretrained LLMs (PLMs) have significantly impacted the field of NLP, achieving notable results across various standardized benchmarks. Despite these successes, challenges remain in adapting PLMs to new languages efficiently, particularly those distant from the original training language. This paper introduces an approach called "active forgetting" during pretraining to enhance the plasticity of LLMs, allowing them to adapt more seamlessly to new languages with limited data.

Core Concepts and Methodology

Pretrained models like RoBERTa store linguistic knowledge in their parameters during the pretraining phase. Transferring this knowledge to new languages typically involves finetuning an embedding layer, but conventional methods demand substantial data and computational resources. The proposed solution involves periodically resetting the token embedding layer during pretraining—termed active forgetting—thereby encouraging the model to refine its adaptation mechanisms.

The active forgetting mechanism operates by resetting token embeddings at regular intervals during training, effectively simulating exposure to diverse linguistic conditions. This process is akin to a meta-learning strategy, enhancing the model's robustness and facilitating faster adaptation during subsequent language-specific finetuning phases.

Numerical Results and Performance

Empirical evaluations were conducted on cross-lingual benchmarks such as XNLI, MLQA, and XQuAD. The paper demonstrated substantial improvements in model performance when adapting with limited data:

For XNLI, the model achieved an average relative gain of +21.2% compared to standard PLMs.
On MLQA, a relative gain of +33.8% was recorded.
An even more significant improvement of +60.9% was observed on XQuAD.

The results indicate that active forgetting substantially enhances the model's ability to generalize to new languages, especially those linguistically distant from the pretraining dataset such as Arabic, Hindi, and Turkish.

Implications and Future Directions

The active forgetting approach underscores the potential of dynamic training strategies for developing more versatile and adaptive LLMs. By fostering linguistic plasticity, PLMs can better cope with novel linguistic inputs and minimize the data and computational demands typical of traditional adaptation processes.

Future research may explore extending this approach to other model architectures and training paradigms, potentially incorporating advanced forgetting techniques like noise injection. Furthermore, understanding the theoretical underpinnings of how such mechanisms affect learning, possibly through the lens of flatness in the loss landscape, can provide deeper insights into optimizing PLM training for adaptability.

In conclusion, active forgetting presents a promising avenue for advancing PLM adaptability, offering significant efficiency improvements for multilingual support and signaling a step towards more flexible, domain-agnostic artificial intelligence systems.

PDF Markdown

Tweets

https://twitter.com/Kseniase_/status/1767109207290229168

https://twitter.com/riedelcastro/status/1763360711269356031

https://twitter.com/hellojeezai/status/1749718769134403726

https://twitter.com/DSYvon/status/1767273978052166108

https://twitter.com/kgourg/status/1783557005807865914

https://twitter.com/FaresAlHaqbani/status/1783742389032473070