An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models (2407.05841v2)

Published 8 Jul 2024 in cs.CL and cs.LG

Abstract: LLMs (LMs) excel in natural language processing tasks for English but show reduced performance in most other languages. This problem is commonly tackled by continually pre-training and fine-tuning these models for said languages. A significant issue in this process is the limited vocabulary coverage in the original model's tokenizer, leading to inadequate representation of new languages and necessitating an expansion of the tokenizer. The initialization of the embeddings corresponding to new vocabulary items presents a further challenge. Current strategies require cross-lingual embeddings and lack a solid theoretical foundation as well as comparisons with strong baselines. In this paper, we first establish theoretically that initializing within the convex hull of existing embeddings is a good initialization, followed by a novel but simple approach, Constrained Word2Vec (CW2V), which does not require cross-lingual embeddings. Our study evaluates different initialization methods for expanding RoBERTa and LLaMA 2 across four languages and five tasks. The results show that CW2V performs equally well or even better than more advanced techniques. Additionally, simpler approaches like multivariate initialization perform on par with these advanced methods indicating that efficient large-scale multilingual continued pretraining can be achieved even with simpler initialization methods. We release our code publicly (https://github.com/AI4Bharat/VocabAdaptation_LLM/tree/CW2V).

PDF HTML Abstract

Empirical Comparison of Vocabulary Expansion and Initialization Approaches for LLMs

The paper "An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for LLMs" explores how to adapt pre-trained LLMs (LMs) like RoBERTa and LLaMA2 to accommodate multiple languages by expanding their vocabulary and initializing the new embeddings effectively. This paper addresses two primary challenges: enhancing the tokenizer’s vocabulary to better capture different languages and effectively initializing the embeddings for these new vocabulary items.

Problem Context and Objectives

LLMs, although highly effective for English, often underperform for other languages due to limited vocabulary coverage in the tokenizer. The authors tackle this issue by introducing new vocabulary items and investigating various initialization strategies for these new embeddings. Current embedding initialization techniques, such as cross-lingual embeddings, although prevalent, lack a strong theoretical basis and comparative analysis against simpler methods. This paper aims to bridge this gap by providing a theoretical foundation for good initialization strategies and introducing Constrained Word2Vec (CW2V), a novel method that forgoes the need for cross-lingual embeddings.

Methodological Approach

The authors establish theoretically that embeddings within the convex hull of the existing embeddings provide a good initialization. They introduce CW2V, which enforces this constraint without needing cross-lingual data. The CW2V method learns the target embeddings by transforming the source embeddings through a weight matrix $W$ , ensuring the new embeddings are within the convex hull of the source embeddings. The paper compares CW2V against five initialization strategies, namely OFA, Multivariate, Univariate, Mean, and Random initialization, across two models (RoBERTa and LLaMA2), four languages (Hindi, Tamil, Russian, German), and five tasks (XNLI, NER, QA, MT, XLSUM).

Experimental Setup

The tokenizer expansion involved creating a unified target tokenizer using SentencePiece, merging subwords from target languages with those from the LLaMA2 tokenizer, ensuring significant reduction in fertility scores for the target languages, thus facilitating a better representation. The CW2V and other baseline initializations were evaluated both before and after continual pre-training (CPT) for various downstream tasks using different metrics.

Results

Initial findings indicate that CW2V preserves the pre-expansion performance for English tasks better than other methods for both RoBERTa and LLaMA2. Specifically, for RoBERTa, CW2V is on par with or slightly inferior to OFA in multilingual tasks, but significantly better in preserving English performance. For LLaMA2, CW2V outperforms OFA across most multilingual and generative tasks while showing comparable performance on English tasks.

With CPT, both RoBERTa and LLaMA2 models initialized via CW2V converge swiftly, aligning with and sometimes surpassing the performance of more sophisticated methods like OFA. Interestingly, simpler methods such as Multivariate initialization, which inherently lie within the convex hull with high probability, also demonstrate competitive performance, suggesting that a strong initialization strategy does not always require complex methodologies.

Implications

Theoretical findings highlighting the importance of convex hull properties in embedding initialization underscore the potential for simpler, computationally cheaper methods to be equally effective for multilingual adaptation. Practically, employing methods like CW2V can significantly improve adaptation performance for different languages in a pre-trained LM without extensive cross-lingual resources. However, initial phases of CPT tend to adversely affect performance on English tasks, necessitating prolonged training to mitigate this effect, hinting at a delicate balance between multilingual adaptation and the preservation of the source language capabilities.

Future Directions

This paper lays a groundwork for further research into more efficient and theoretically grounded initialization strategies. Future work can explore broader sets of languages and tasks, finer adjustments to CPT to better balance source and target language performance, and the potential integration of these findings with dynamic vocabulary adaptation methods.

By addressing the inherent challenges in vocabulary expansion and providing empirical evidence for the efficacy of simple, yet theoretically sound methods, this paper contributes significantly to the ongoing development of robust, multilingual LLMs.