Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training (2109.07306v1)

Published 15 Sep 2021 in cs.CL

Abstract: Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual LLMs due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual LLM pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Bo Zheng (205 papers)
  2. Li Dong (154 papers)
  3. Shaohan Huang (79 papers)
  4. Saksham Singhal (14 papers)
  5. Wanxiang Che (152 papers)
  6. Ting Liu (329 papers)
  7. Xia Song (38 papers)
  8. Furu Wei (291 papers)
Citations (21)