Efficient LLM Training through Cross-Lingual and Progressive Transfer Learning
The paper "Efficient LLM Training through Cross-Lingual and Progressive Transfer Learning" by Malte Ostendorff and Georg Rehm at DFKI GmbH presents an innovative methodology, termed CLP-Transfer, designed to enhance the training efficiency of LLMs, particularly for languages with limited resources. The primary contribution of CLP-Transfer is the combination of cross-lingual and progressive transfer learning to address efficiency challenges in training LLMs for these languages.
Summary
The authors focus on improving the training efficiency for Transformer-based LLMs, which predominantly rely on English text for pretraining. The growth in model sizes has exacerbated the performance disparity between English and other languages with fewer computational and data resources. Consequently, the paper introduces CLP-Transfer, a technique to bridge this gap. CLP-Transfer leverages pretrained models available for a source language (commonly English) and progressively transfers these capabilities to a target language, potentially with fewer computational demands.
With CLP-Transfer, the authors propose extending the cross-lingual transfer beyond language to include model size. They aim to scale a pretrained model from a source language to a target language of equal model size, starting with a smaller model that requires fewer resources. This methodology begins with the overlap in vocabulary between the source and target languages to initialize token embeddings and reuses the remaining weights from the source LLM. This approach outperforms traditional cross-lingual transfer techniques in terms of training efficiency and can lead to a reduction of training steps by up to 80%.
Methodology and Assumptions
CLP-Transfer is built on several assumptions regarding the shared vocabulary and token embedding similarities across different model sizes. The paper postulates that the tokenizers of source and target languages share a substantial fraction of their vocabulary, and the relative positioning in the embedding space remains comparable across model sizes. These assumptions are critical for leveraging pretrained weights effectively across languages and sizes.
Initially, the method utilizes overlapping vocabulary tokens from the source to target languages for direct token embedding initialization. For non-overlapping vocabulary tokens, it creatively uses a weighted average over embeddings of overlapping tokens to initialize new embeddings. Transformer layer weights are directly inherited from the source LLM. Such a strategy facilitates a transformation that acknowledges both the linguistic characteristics of the target language and the structural properties of large models.
Experimental Evaluation and Results
The efficiency and efficacy of CLP-Transfer were demonstrated through experiments focusing on models trained for the German language. Models based on GPT2 and BLOOM architectures were scaled from smaller bases using both cross-lingual (CLP-Transfer) and traditional training methods (e.g., WECHSEL and from-scratch training). Notably, CLP-Transfer achieved superior results compared to from-scratch and WECHSEL training, significantly reducing the compute resources needed to reach competitive perplexity levels—a testament to its resource-effective nature.
Implications and Future Directions
This paper's implications are significant for the ongoing development of language technologies across diverse linguistic contexts. CLP-Transfer opens avenues for producing efficient and performant models in low-resource scenarios, which are critical for broadening AI's applicability globally. Future investigations may explore extending CLP-Transfer to more complex model architectures, broader language families, and further optimization of transfer mechanisms to enhance generalization and downstream task performance.
While the primary focus remains on efficiency gains through innovative technological insights, the paper hints at the necessity for empirical assessments across varied linguistic datasets to refine assumptions about vocabulary and embedding alignment. The authors' contributions lay foundational work for the AI community to build scalable and adaptive solutions capable of transcending linguistic barriers, potentially fostering unprecedented LLM growth in underrepresented languages worldwide.