Pre-Training with Whole Word Masking for Chinese BERT (1906.08101v3)

Published 19 Jun 2019 in cs.CL and cs.LG

Abstract: Bidirectional Encoder Representations from Transformers (BERT) has shown marvelous improvements across various NLP tasks, and its consecutive variants have been proposed to further improve the performance of the pre-trained LLMs. In this paper, we aim to first introduce the whole word masking (wwm) strategy for Chinese BERT, along with a series of Chinese pre-trained LLMs. Then we also propose a simple but effective model called MacBERT, which improves upon RoBERTa in several ways. Especially, we propose a new masking strategy called MLM as correction (Mac). To demonstrate the effectiveness of these models, we create a series of Chinese pre-trained LLMs as our baselines, including BERT, RoBERTa, ELECTRA, RBT, etc. We carried out extensive experiments on ten Chinese NLP tasks to evaluate the created Chinese pre-trained LLMs as well as the proposed MacBERT. Experimental results show that MacBERT could achieve state-of-the-art performances on many NLP tasks, and we also ablate details with several findings that may help future research. We open-source our pre-trained LLMs for further facilitating our research community. Resources are available: https://github.com/ymcui/Chinese-BERT-wwm

PDF Abstract

Pre-Training with Whole Word Masking for Chinese BERT

The paper presents an exploration into enhancing pre-trained LLMs specifically tailored for the Chinese language, focusing on the adaptation and improvement of BERT using whole word masking strategies. The paper systematically assesses and proposes advancements in the pre-training methodologies with the introduction of a novel model, MacBERT.

Improvements in Chinese BERT Models

The initial exploration involves implementing whole word masking (WWM) for Chinese BERT. Unlike previous token-level masking approaches, WWM masks entire words, ensuring that the model learns to predict missing words instead of token fragments, which presents a more challenging task and strengthens the model's contextual learning capabilities.

The researchers introduced RoBERTa with WWM and further extended the training corpus using additional Chinese datasets, resulting in the 'ext' versions of these models. The paper highlights how extending the training corpus is crucial for achieving better models, given the relatively smaller size of the Chinese Wikipedia compared to its English counterpart.

Introduction of MacBERT

MacBERT represents a significant innovation by modifying the Masked LLM (MLM) task to align better with real-world applications. It transforms MLM to an "MLM as correction" (Mac) task, addressing the discrepancies between pre-training and fine-tuning phases. Instead of using artificial tokens like [MASK], MacBERT replaces words with synonyms, simulating spelling or grammatical errors, thus making the learning task more realistic and applicable to correction tasks.

Additionally, MacBERT adopts the sentence order prediction (SOP) task instead of the next sentence prediction (NSP) to enhance the learning of sentence relationships, which is shown to be more effective in capturing the natural sentence order and improving downstream task performance.

Experimental Evaluation

Extensive experiments on a variety of Chinese NLP tasks demonstrate the effectiveness of MacBERT. The evaluation covers ten different tasks, including machine reading comprehension, sentiment analysis, and sentence pair classification. MacBERT consistently outperforms baseline models across these tasks, achieving state-of-the-art results.

The work also compares small models, revealing that thinner and deeper architectures (e.g., RBT6) generally perform better than wider and shallower ones (e.g., RBTL3) within similar parameter constraints. This insight is particularly useful for deploying models in resource-constrained environments.

Implications and Future Directions

The implications of this work are twofold. Practically, it provides robust pre-trained models for a wider range of Chinese NLP applications, with open-access resources to facilitate further research. Theoretically, it presents compelling evidence on the effectiveness of pre-training strategies tailored for non-English languages and the potential benefits of realistic pre-training tasks that mirror downstream applications.

Future research directions could focus on adopting similar strategies for other languages, adjusting masking strategies dynamically, and improving the alignment of pre-training with diverse downstream tasks. Moreover, exploring the integration of domain-specific knowledge and continual learning techniques may further enhance model adaptability and performance.

Conclusion

Overall, this paper makes substantial contributions to the field of NLP for Chinese language processing by refining existing pre-training strategies and introducing innovative approaches like MacBERT. The research effectively bridges the gap between pre-training techniques and practical application requirements, setting a precedent for future developments in the domain of LLM pre-training.