Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training Multilingual Pre-trained Language Model with Byte-level Subwords (2101.09469v2)

Published 23 Jan 2021 in cs.CL

Abstract: The pre-trained LLMs have achieved great successes in various natural language understanding (NLU) tasks due to its capacity to capture the deep contextualized information in text by pre-training on large-scale corpora. One of the fundamental components in pre-trained LLMs is the vocabulary, especially for training multilingual models on many different languages. In the technical report, we present our practices on training multilingual pre-trained LLMs with BBPE: Byte-Level BPE (i.e., Byte Pair Encoding). In the experiment, we adopted the architecture of NEZHA as the underlying pre-trained LLM and the results show that NEZHA trained with byte-level subwords consistently outperforms Google multilingual BERT and vanilla NEZHA by a notable margin in several multilingual NLU tasks. We release the source code of our byte-level vocabulary building tools and the multilingual pre-trained LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Junqiu Wei (4 papers)
  2. Qun Liu (230 papers)
  3. Yinpeng Guo (6 papers)
  4. Xin Jiang (242 papers)
Citations (18)