Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Pre-Trained Multilingual Models with Vocabulary Expansion (1909.12440v1)

Published 26 Sep 2019 in cs.CL

Abstract: Recently, pre-trained LLMs have achieved remarkable success in a broad range of natural language processing tasks. However, in multilingual setting, it is extremely resource-consuming to pre-train a deep LLM over large-scale corpora for each language. Instead of exhaustively pre-training monolingual LLMs independently, an alternative solution is to pre-train a powerful multilingual deep LLM over large-scale corpora in hundreds of languages. However, the vocabulary size for each language in such a model is relatively small, especially for low-resource languages. This limitation inevitably hinders the performance of these multilingual models on tasks such as sequence labeling, wherein in-depth token-level or sentence-level understanding is essential. In this paper, inspired by previous methods designed for monolingual settings, we investigate two approaches (i.e., joint mapping and mixture mapping) based on a pre-trained multilingual model BERT for addressing the out-of-vocabulary (OOV) problem on a variety of tasks, including part-of-speech tagging, named entity recognition, machine translation quality estimation, and machine reading comprehension. Experimental results show that using mixture mapping is more promising. To the best of our knowledge, this is the first work that attempts to address and discuss the OOV issue in multilingual settings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hai Wang (98 papers)
  2. Dian Yu (78 papers)
  3. Kai Sun (317 papers)
  4. Janshu Chen (1 paper)
  5. Dong Yu (328 papers)
Citations (38)