Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SambaLingo: Teaching Large Language Models New Languages (2404.05829v2)

Published 8 Apr 2024 in cs.CL, cs.AI, and cs.LG
SambaLingo: Teaching Large Language Models New Languages

Abstract: Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

Comprehensive Study on Adapting LLMs to New Languages

Introduction to LLM Adaptation

The adaptation of pre-trained LLMs to new languages has emerged as a promising avenue for leveraging existing computational and data resources to extend the utility of these models across diverse linguistic landscapes. This paper presents an extensive paper focusing on various strategies for adapting LLMs to nine typologically diverse languages, including Arabic, Bulgarian, Hungarian, Japanese, Russian, Serbian, Slovenian, Thai, and Turkish. The research explores vocabulary extension, continuous pre-training, and methods for aligning models with human preferences in low-resource languages. Through meticulous experimentation, the paper sets new performance benchmarks, outperforming previous models in these languages across several dimensions.

Key Findings on LLM Adaptation

Vocabulary Expansion and Model Initialisation

The paper highlighted the significance of expanding the model's vocabulary to include tokens from the target language, which although did not substantially improve downstream task accuracy, enhanced tokenizer efficiency and inference performance in the target languages. Different strategies for initializing new token embeddings were explored, with the method of averaging sub-word embeddings showing accelerated convergence during training with minimal impact on final accuracy.

Continuous Pre-training with Mixed Language Data

The effectiveness of continuous pre-training was demonstrated through a methodology that involves training on a mixture of English and target language web data. The research indicates that including a higher proportion of target language data aids in achieving faster convergence and better performance in the target language, underscoring the importance of balanced and thoughtfully curated training corpora.

Human Preference Alignment with Limited Data

An innovative aspect of this paper is its approach to aligning models with human preferences using a minimal amount of alignment data. The findings suggest that a judicious mixture of translated alignment data can be nearly as effective as exclusively using data written in the target language for model alignment, thus mitigating the challenge of data scarcity in low-resource languages.

Quantitative Benchmarks and Evaluation

The adapted models were benchmarked against a suite of established multilingual and language-specific tests, showing superior performance over previous state-of-the-art models. Through rigorous evaluation, the adapted models demonstrated improvements in perplexity, translation quality, text classification, and natural language understanding tasks across all target languages. These results validate the effectiveness of the proposed adaptation methodology and underscore its potential as a scalable solution for enhancing the accessibility and utility of LLMs across a wider array of languages.

Future Directions in LLM Adaptation

This comprehensive paper not only advances our understanding of the processes involved in adapting LLMs to new languages but also sets the stage for future research in this area. The open sourcing of code and checkpoints is likely to stimulate further developments, enabling researchers to build upon the solid foundation laid by this work. Future endeavors may explore deeper into the nuances of language-specific model tuning, the exploration of even more languages, including those with non-Latin scripts and unique linguistic features, and the refinement of human preference alignment techniques to cater to diverse cultural and regional nuances.

Conclusion

In conclusion, this paper contributes significantly to the field of computational linguistics by providing a detailed protocol for the adaptation of LLMs to new languages, supported by empirical evidence of its efficacy across a wide range of linguistic tasks. By addressing key challenges such as vocabulary extension, training data scarcity, and the alignment with human preferences, this work paves the way for the development of more accessible, efficient, and versatile LLMs, democratizing the benefits of AI across linguistic boundaries.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. In Yvette Graham and Matthew Purver (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  226–245, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.14.
  2. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  3. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884, 2023.
  4. Breaking the curse of multilinguality with cross-lingual expert language models, 2024.
  5. When is multilinguality a curse? language modeling for 250 high- and low-resource languages, 2023.
  6. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining, 2023.
  7. XNLI: Evaluating cross-lingual sentence representations. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2475–2485, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1269. URL https://aclanthology.org/D18-1269.
  8. Unsupervised cross-lingual representation learning at scale, 2020.
  9. Efficiently adapting pretrained language models to new languages, 2023.
  10. Ultrafeedback: Boosting language models with high-quality feedback, 2023a.
  11. Efficient and effective text encoding for chinese llama and alpaca, 2023b.
  12. How to adapt your pretrained multilingual model to 1600 languages, 2021.
  13. Wino-X: Multilingual Winograd schemas for commonsense reasoning and coreference resolution. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  8517–8532, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.670. URL https://aclanthology.org/2021.emnlp-main.670.
  14. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  15. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. 2021.
  16. Continual pre-training of large language models: How to (re)warm your model?, 2023.
  17. Ilya Gusev. Saiga 7b, 2023. URL https://huggingface.co/IlyaGusev/saiga_mistral_7b_merged.
  18. EXAMS: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5427–5444, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.438. URL https://aclanthology.org/2020.emnlp-main.438.
  19. John Hewitt. Initializing new word embeddings for pretrained language models, 2021. URL https://nlp.stanford.edu/~johnhew//vocab-expansion.html.
  20. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  21. Constitutional ai recipe. Hugging Face Blog, 2024. URL https://huggingface.co/blog/constitutional_ai.
  22. OPT-IML: scaling language model instruction meta learning through the lens of generalization. CoRR, abs/2212.12017, 2022. doi: 10.48550/ARXIV.2212.12017. URL https://doi.org/10.48550/arXiv.2212.12017.
  23. Indobertweet: A pretrained language model for indonesian twitter with effective domain-specific vocabulary initialization, 2021.
  24. Openassistant conversations – democratizing large language model alignment, 2023.
  25. Mala-500: Massive language adaptation of large language models, 2024.
  26. Few-shot learning with multilingual language models, 2022.
  27. Chipnemo: Domain-adapted llms for chip design, 2024.
  28. Scaling data-constrained language models, 2023.
  29. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023.
  30. Gpt-4 technical report, 2024.
  31. Typhoon: Thai large language models, 2023.
  32. Sabiá: Portuguese Large Language Models, pp.  226–240. Springer Nature Switzerland, 2023. ISBN 9783031453922. doi: 10.1007/978-3-031-45392-2˙15. URL http://dx.doi.org/10.1007/978-3-031-45392-2_15.
  33. XCOPA: A multilingual dataset for causal commonsense reasoning. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  2362–2376, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.185. URL https://aclanthology.org/2020.emnlp-main.185.
  34. Maja Popović. chrF: character n-gram F-score for automatic MT evaluation. In Ondřej Bojar, Rajan Chatterjee, Christian Federmann, Barry Haddow, Chris Hokamp, Matthias Huck, Varvara Logacheva, and Pavel Pecina (eds.), Proceedings of the Tenth Workshop on Statistical Machine Translation, pp.  392–395, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/W15-3049. URL https://aclanthology.org/W15-3049.
  35. Sambanova sn10 rdu: A 7nm dataflow architecture to accelerate software 2.0. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), volume 65, pp.  350–352, 2022. doi: 10.1109/ISSCC42614.2022.9731612.
  36. Together Computer SambaNova Systems. BLOOMChat: a New Open Multilingual Chat LLM, 5 2023. URL https://huggingface.co/sambanovasystems/BLOOMChat-176B-v1.
  37. Elyza-japanese-llama-2-7b, 2023. URL https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b.
  38. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models, 2023.
  39. mgpt: Few-shot learners go multilingual, 2023.
  40. Training large language models efficiently with sparsity and dataflow, 2023.
  41. A comparative study between full-parameter and lora-based fine-tuning on chinese instruction data for instruction following large language model. arXiv preprint arXiv:2304.08109, 2023.
  42. SambaNova Systems. x-self-instruct-seed-32, 5 2023a. URL https://huggingface.co/datasets/sambanovasystems/x-self-instruct-seed-32.
  43. SambaNova Systems. xoa22, 5 2023b. URL https://huggingface.co/datasets/sambanovasystems/xOA22.
  44. Impact of tokenization on llama russian adaptation, 2023.
  45. TokyoTech. Swallow 7b, 2023. URL https://huggingface.co/tokyotech-llm/Swallow-7b-hf.
  46. Llama 2: Open foundation and fine-tuned chat models, 2023.
  47. Zephyr: Direct distillation of lm alignment, 2023.
  48. Bloom: A 176b-parameter open-access multilingual language model, 2023.
  49. mt5: A massively multilingual pre-trained text-to-text transformer, 2021.
  50. PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3687–3692, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1382. URL https://aclanthology.org/D19-1382.
  51. Mono- and multilingual gpt-3 models for hungarian. In Text, Speech, and Dialogue, Lecture Notes in Computer Science, pp.  94–104, Plzeň, Czech Republic, 2023. Springer Nature Switzerland. ISBN 978-3-031-40498-6.
  52. Language versatilists vs. specialists: An empirical revisiting on multilingual transfer ability, 2023.
  53. Bloom+1: Adding language support to bloom for zero-shot prompting, 2023.
  54. Cpm: A large-scale generative chinese pre-trained language model, 2020.
  55. Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507, 2024a.
  56. Llama beyond english: An empirical study on language capability transfer, 2024b.
  57. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  58. Multilingual machine translation with large language models: Empirical results and analysis, 2023.
  59. Judit Ács. Exploring bert’s vocabulary, February 2019. URL https://juditacs.github.io/2019/02/19/bert-tokenization-stats.html.
  60. Aya model: An instruction finetuned open-access multilingual language model, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zoltan Csaki (4 papers)
  2. Bo Li (1107 papers)
  3. Jonathan Li (62 papers)
  4. Qiantong Xu (26 papers)
  5. Pian Pawakapan (3 papers)
  6. Leon Zhang (11 papers)
  7. Yun Du (9 papers)
  8. Hengyu Zhao (3 papers)
  9. Changran Hu (10 papers)
  10. Urmish Thakker (26 papers)
Citations (5)