Should You Mask 15% in Masked Language Modeling? (2202.08005v3)

Published 16 Feb 2022 in cs.CL and cs.LG

Abstract: Masked LLMs (MLMs) conventionally mask 15% of tokens due to the belief that more masking would leave insufficient context to learn good representations; this masking rate has been widely used, regardless of model sizes or masking strategies. In this work, we revisit this important choice of MLM pre-training. We first establish that 15% is not universally optimal, and larger models should adopt a higher masking rate. Specifically, we find that masking 40% outperforms 15% for BERT-large size models on GLUE and SQuAD. Interestingly, an extremely high masking rate of 80% can still preserve 95% fine-tuning performance and most of the accuracy in linguistic probing, challenging the conventional wisdom about the role of the masking rate. We then examine the interplay between masking rates and masking strategies and find that uniform masking requires a higher masking rate compared to sophisticated masking strategies such as span or PMI masking. Finally, we argue that increasing the masking rate has two distinct effects: it leads to more corruption, which makes the prediction task more difficult; it also enables more predictions, which benefits optimization. Using this framework, we revisit BERT's 80-10-10 corruption strategy. Together, our results contribute to a better understanding of MLM pre-training.

Authors (4)

Alexander Wettig (21 papers)
Tianyu Gao (35 papers)
Zexuan Zhong (17 papers)
Danqi Chen (84 papers)

Citations (139)

View on Semantic Scholar

Summary

Analyzing Masking Rates in Masked LLMing

The paper "Should You Mask 15% in Masked LLMing?" by Wettig et al. challenges conventional assumptions about the optimal choice of masking rates in masked LLMs (MLMs). The traditional practice of masking 15% of tokens has been pervasive across various sizes and strategies of MLMs, largely based on the belief that more masking would hinder the ability to learn effective representations, and less masking would reduce training efficiency. This paper revisits the established norms and provides evidence that larger models might benefit from higher masking rates, contrary to the prevalent strategy.

The authors conducted an array of experiments using BERT-large models, fine-tuning on popular benchmarks such as GLUE and SQuAD, and found that a higher masking rate of 40% improved performance compared to the 15% baseline. Remarkably, even extremely high masking rates like 80% retained up to 95% of downstream task performance, which presents a disruptive insight into MLM training paradigms by suggesting that model capacity might track mask rate efficacy more closely than previously assumed. This observation raises questions about the traditional justifications for limiting mask rates, implying that recent advances in model architectures and capacities might facilitate learning under more challenging contexts.

Delving deeper into masking strategies, the paper establishes that different strategies demand different optimal rates. Uniform masking, which is simpler and more commonplace, benefits from higher masking rates compared to advanced strategies like span or PMI masking. With higher mask rates, uniform masking can cover more token spans and n-grams, thus inadvertently simulating the effects of more sophisticated strategies with less complexity.

Another pivotal contribution of the paper is the conceptual disentangling of masking into corruption and prediction rates. The corruption rate refers to the percentage of context tokens altered or removed, while the prediction rate denotes the percentage of tokens the model attempts to predict based on the corrupted context. Through ablation experiments, the paper illustrates how high prediction rates generate more learning signals and are advantageous, whereas high corruption rates increase task difficulty, leading to inadequate context for masking predictions. This analysis challenges our understanding of the masking mechanism, suggesting a nuanced interaction where prediction benefits might outweigh corruption disadvantages.

The authors also critically reevaluate the 80-10-10 corruption strategy invented by BERT, wherein a portion of masked tokens are replaced by the original or random tokens during pre-training. Empirically, they found that this strategy does not outperform models trained solely with mask tokens, suggesting that same-token substitutions and random corruptions may not be essential when MLMs are fine-tuned on complete, corruption-free contexts.

In summary, this paper provides compelling insights into MLM pre-training, suggesting that higher masking rates can be beneficial, particularly for larger models, and that the trade-offs between corruption and prediction rates should be considered in optimizing MLM training strategies. These findings have practical implications for improving training efficiency and theoretical implications for our understanding of LLM training processes. Future research could continue to explore the boundaries of masking practices, possibly leading to more optimized models that leverage higher masking rates without sacrificing performance accuracy.

Related Papers

Tweets

https://twitter.com/l_g88300/status/1779976467390042137

https://twitter.com/carbonat38/status/1779942394903019914

https://twitter.com/_joaogui1/status/1772703474439815206

https://twitter.com/ihor_step/status/1821075045864681590

https://twitter.com/sivareddyg/status/1780239912463176167

https://twitter.com/stephennfern/status/1780036785973956758