From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages (2410.18836v1)
Abstract: In this paper, we propose a model-agnostic cost-effective approach to developing bilingual base LLMs to support English and any target language. The method includes vocabulary expansion, initialization of new embeddings, model training and evaluation. We performed our experiments with three languages, each using a non-Latin script - Ukrainian, Arabic, and Georgian. Our approach demonstrates improved language performance while reducing computational costs. It mitigates the disproportionate penalization of underrepresented languages, promoting fairness and minimizing adverse phenomena such as code-switching and broken grammar. Additionally, we introduce new metrics to evaluate language quality, revealing that vocabulary size significantly impacts the quality of generated text.
- ACRPS. Doha historical dictionary of arabic.
- FLAIR: An easy-to-use framework for state-of-the-art NLP. In NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59.
- Arabic topic classification in the generative and AutoML era. In Proceedings of ArabicNLP 2023, pages 399–404, Singapore (Hybrid). Association for Computational Linguistics.
- Enhancing amharic-llama: Integrating task specific and generative datasets. Preprint, arXiv:2402.08015.
- Llamantino: Llama 2 models for effective text generation in italian language. Preprint, arXiv:2312.09993.
- Breaking the curse of multilinguality with cross-lingual expert language models. Preprint, arXiv:2401.10440.
- When is multilinguality a curse? language modeling for 250 high- and low-resource languages. Preprint, arXiv:2311.09205.
- Dmytro Chaplynskyi. 2023. Introducing UberText 2.0: A corpus of modern Ukrainian at scale. In Proceedings of the Second Ukrainian Natural Language Processing Workshop, pages 1–10, Dubrovnik, Croatia. Association for Computational Linguistics.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Efficient and effective text encoding for chinese llama and alpaca. Preprint, arXiv:2304.08177.
- Konstantin Dobler and Gerard de Melo. 2023. Focus: Effective embedding initialization for monolingual specialization of multilingual models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- ParaCrawl: Web-scale parallel corpora for the languages of the EU. In Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks, pages 118–119, Dublin, Ireland. European Association for Machine Translation.
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings.
- "hemanth kumar". Tamil-mistral-7b-v0.1.
- Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR).
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- John Hewitt. 2021. Initializing new word embeddings for pretrained language models.
- spaCy: Industrial-strength Natural Language Processing in Python.
- The interplay of variant, size, and task type in Arabic pre-trained language models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine (Online). Association for Computational Linguistics.
- Lisan: Yemeni, iraqi, libyan, and sudanese arabic dialect corpora with morphological annotations. In 20th ACS/IEEE International Conference on Computer Systems and Applications, AICCSA 2023, Giza, Egypt, December 4-7, 2023, pages 1–7. IEEE.
- Mistral 7b. Preprint, arXiv:2310.06825.
- Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling. Preprint, arXiv:2312.15166.
- Building a llama2-finetuned llm for odia language utilizing domain knowledge instruction set. Preprint, arXiv:2312.12624.
- Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Preprint, arXiv:1808.06226.
- GigaBERT: Zero-shot transfer learning from English to Arabic. In Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP).
- Improving stemming for arabic information retrieval: light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’02, page 275–282, New York, NY, USA. Association for Computing Machinery.
- Understanding and mitigating language confusion in llms.
- Eurollm: Multilingual language models for europe. Preprint, arXiv:2409.16235.
- Mohamed Megahed. 2021. Sequence labeling architectures in diglossia - a case study of arabic and its dialects.
- Vinallama: Llama-based vietnamese foundation model. Preprint, arXiv:2312.11011.
- Gpt-4 technical report. Preprint, arXiv:2303.08774.
- OSCAR. Oscar.
- Language model tokenizers introduce unfairness between languages. Preprint, arXiv:2305.15425.
- Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3479–3495, Seattle, United States. Association for Computational Linguistics.
- Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101–108, Online. Association for Computational Linguistics.
- How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3118–3135, Online. Association for Computational Linguistics.
- Bloom: A 176b-parameter open-access multilingual language model. Preprint, arXiv:2211.05100.
- Aya dataset: An open-access collection for multilingual instruction tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11521–11567, Bangkok, Thailand. Association for Computational Linguistics.
- Scaling laws with vocabulary: Larger models deserve larger vocabularies. Preprint, arXiv:2407.13623.
- No language left behind: Scaling human-centered machine translation.
- Exploring design choices for building language-specific llms.
- The Alignment Handbook.
- Attention is all you need. CoRR, abs/1706.03762.
- James Vo. 2024. Vi-mistral-x: Building a vietnamese language model with advanced continual pre-training. Preprint, arXiv:2403.15470.
- Are multilingual models effective in code-switching? CoRR, abs/2103.13309.
- Huggingface’s transformers: State-of-the-art natural language processing. arxiv. arXiv preprint arXiv:1910.03771.
- SkyPilot: An intercloud broker for sky computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, Boston, MA. USENIX Association.
- Multilingual large language models are not (yet) code-switchers. Preprint, arXiv:2305.14235.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.