MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining (2011.08539v1)
Abstract: Despite the development of pre-trained LLMs (PLMs) significantly raise the performances of various Chinese NLP tasks, the vocabulary for these Chinese PLMs remain to be the one provided by Google Chinese Bert \cite{devlin2018bert}, which is based on Chinese characters. Second, the masked LLM pre-training is based on a single vocabulary, which limits its downstream task performances. In this work, we first propose a novel method, \emph{seg_tok}, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization. Then we propose three versions of multi-vocabulary pretraining (MVP) to improve the models expressiveness. Experiments show that: (a) compared with char based vocabulary, \emph{seg_tok} does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency; (b) MVP improves PLMs' downstream performance, especially it can improve \emph{seg_tok}'s performances on sequence labeling tasks.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.