Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning (2301.09626v1)

Published 23 Jan 2023 in cs.CL and cs.AI

Abstract: Most Transformer LLMs are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efficient training methods are needed to bridge the gap for languages with fewer resources available. To address this problem, we introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer, that transfers models from a source language, for which pretrained models are publicly available, like English, to a new target language. As opposed to prior work, which focused on the cross-lingual transfer between two languages, we extend the transfer to the model size. Given a pretrained model in a source language, we aim for a same-sized model in a target language. Instead of training a model from scratch, we exploit a smaller model that is in the target language but requires much fewer resources. Both small and source models are then used to initialize the token embeddings of the larger model based on the overlapping vocabulary of the source and target language. All remaining weights are reused from the model in the source language. This approach outperforms the sole cross-lingual transfer and can save up to 80% of the training steps compared to the random initialization.

PDF Abstract

Efficient LLM Training through Cross-Lingual and Progressive Transfer Learning

The paper "Efficient LLM Training through Cross-Lingual and Progressive Transfer Learning" by Malte Ostendorff and Georg Rehm at DFKI GmbH presents an innovative methodology, termed CLP-Transfer, designed to enhance the training efficiency of LLMs, particularly for languages with limited resources. The primary contribution of CLP-Transfer is the combination of cross-lingual and progressive transfer learning to address efficiency challenges in training LLMs for these languages.

Summary

The authors focus on improving the training efficiency for Transformer-based LLMs, which predominantly rely on English text for pretraining. The growth in model sizes has exacerbated the performance disparity between English and other languages with fewer computational and data resources. Consequently, the paper introduces CLP-Transfer, a technique to bridge this gap. CLP-Transfer leverages pretrained models available for a source language (commonly English) and progressively transfers these capabilities to a target language, potentially with fewer computational demands.

With CLP-Transfer, the authors propose extending the cross-lingual transfer beyond language to include model size. They aim to scale a pretrained model from a source language to a target language of equal model size, starting with a smaller model that requires fewer resources. This methodology begins with the overlap in vocabulary between the source and target languages to initialize token embeddings and reuses the remaining weights from the source LLM. This approach outperforms traditional cross-lingual transfer techniques in terms of training efficiency and can lead to a reduction of training steps by up to 80%.

Methodology and Assumptions

CLP-Transfer is built on several assumptions regarding the shared vocabulary and token embedding similarities across different model sizes. The paper postulates that the tokenizers of source and target languages share a substantial fraction of their vocabulary, and the relative positioning in the embedding space remains comparable across model sizes. These assumptions are critical for leveraging pretrained weights effectively across languages and sizes.

Initially, the method utilizes overlapping vocabulary tokens from the source to target languages for direct token embedding initialization. For non-overlapping vocabulary tokens, it creatively uses a weighted average over embeddings of overlapping tokens to initialize new embeddings. Transformer layer weights are directly inherited from the source LLM. Such a strategy facilitates a transformation that acknowledges both the linguistic characteristics of the target language and the structural properties of large models.

Experimental Evaluation and Results

The efficiency and efficacy of CLP-Transfer were demonstrated through experiments focusing on models trained for the German language. Models based on GPT2 and BLOOM architectures were scaled from smaller bases using both cross-lingual (CLP-Transfer) and traditional training methods (e.g., WECHSEL and from-scratch training). Notably, CLP-Transfer achieved superior results compared to from-scratch and WECHSEL training, significantly reducing the compute resources needed to reach competitive perplexity levels—a testament to its resource-effective nature.

Implications and Future Directions

This paper's implications are significant for the ongoing development of language technologies across diverse linguistic contexts. CLP-Transfer opens avenues for producing efficient and performant models in low-resource scenarios, which are critical for broadening AI's applicability globally. Future investigations may explore extending CLP-Transfer to more complex model architectures, broader language families, and further optimization of transfer mechanisms to enhance generalization and downstream task performance.

While the primary focus remains on efficiency gains through innovative technological insights, the paper hints at the necessity for empirical assessments across varied linguistic datasets to refine assumptions about vocabulary and embedding alignment. The authors' contributions lay foundational work for the AI community to build scalable and adaptive solutions capable of transcending linguistic barriers, potentially fostering unprecedented LLM growth in underrepresented languages worldwide.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Malte Ostendorff (23 papers)
Georg Rehm (32 papers)

Citations (18)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - malteos/clp-transfer: Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning (30 stars)

Tweets

https://twitter.com/felix_red_panda/status/1779874063839560130