Papers
Topics
Authors
Recent
2000 character limit reached

MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining (2011.08539v1)

Published 17 Nov 2020 in cs.CL

Abstract: Despite the development of pre-trained LLMs (PLMs) significantly raise the performances of various Chinese NLP tasks, the vocabulary for these Chinese PLMs remain to be the one provided by Google Chinese Bert \cite{devlin2018bert}, which is based on Chinese characters. Second, the masked LLM pre-training is based on a single vocabulary, which limits its downstream task performances. In this work, we first propose a novel method, \emph{seg_tok}, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization. Then we propose three versions of multi-vocabulary pretraining (MVP) to improve the models expressiveness. Experiments show that: (a) compared with char based vocabulary, \emph{seg_tok} does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency; (b) MVP improves PLMs' downstream performance, especially it can improve \emph{seg_tok}'s performances on sequence labeling tasks.

Citations (6)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.