Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models (2402.14714v1)
Abstract: This report introduces \texttt{EEVE-Korean-v1.0}, a Korean adaptation of LLMs that exhibit remarkable capabilities across English and Korean text understanding. Building on recent highly capable but English-centric LLMs, such as SOLAR-10.7B and Phi-2, where non-English texts are inefficiently processed with English-centric tokenizers, we present an efficient and effective vocabulary expansion (EEVE) method, which encompasses parameter freezing and subword initialization. In contrast to previous efforts that believe new embeddings require trillions of training tokens, we show that our method can significantly boost non-English proficiency within just 2 billion tokens. Surpassing most instruction-tuned LLMs on the Open Ko-LLM Leaderboard, as of January 2024, our model \texttt{EEVE-Korean-10.8B-v1.0} ranks as the leading Korean pre-trained model in the open-source community, according to Hugging Face's leaderboard. We open-source our models on Huggingface to empower the open research community in various languages.
- The falcon series of open language models. arXiv preprint arXiv:2311.16867.
- Anthropic. 2023. Model card and evaluations for claude models. Anthropic technical Report.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
- A framework for few-shot language model evaluation.
- John Hewitt. 2021. Initializing new word embeddings for pretrained language models.
- hiyouga. 2023. Llama factory. https://github.com/hiyouga/LLaMA-Factory.
- Kobest: Korean balanced evaluation of significant tasks. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3697–3708.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166.
- Kogpt: Kakaobrain korean(hangul) generative pretrained transformer.
- A technical report for polyglot-ko: Open-source large-scale korean language models. arXiv preprint arXiv:2306.02254.
- L. Junbum. 2024. Solar-ko-10.7b.
- Repetition in repetition out: Towards understanding neural text degeneration from the data perspective. In Thirty-seventh Conference on Neural Information Processing Systems.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
- Slimorca dedup: A deduplicated subset of slimorca.
- Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Open ko-llm leaderboard.
- Language model tokenizers introduce unfairness between languages. arXiv preprint arXiv:2305.15425.
- Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- MosaicML NLP Team et al. 2023b. Introducing mpt-30b: Raising the bar for open-source foundation models.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Improving low compute language modeling with in-domain embedding initialisation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8625–8634.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
- Multilingual large language models are not (yet) code-switchers. arXiv preprint arXiv:2305.14235.
- Llama beyond english: An empirical study on language capability transfer. arXiv preprint arXiv:2401.01055.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Seungduk Kim (1 paper)
- Seungtaek Choi (14 papers)
- Myeongho Jeong (7 papers)