Analyzing Masking Rates in Masked LLMing
The paper "Should You Mask 15% in Masked LLMing?" by Wettig et al. challenges conventional assumptions about the optimal choice of masking rates in masked LLMs (MLMs). The traditional practice of masking 15% of tokens has been pervasive across various sizes and strategies of MLMs, largely based on the belief that more masking would hinder the ability to learn effective representations, and less masking would reduce training efficiency. This paper revisits the established norms and provides evidence that larger models might benefit from higher masking rates, contrary to the prevalent strategy.
The authors conducted an array of experiments using BERT-large models, fine-tuning on popular benchmarks such as GLUE and SQuAD, and found that a higher masking rate of 40% improved performance compared to the 15% baseline. Remarkably, even extremely high masking rates like 80% retained up to 95% of downstream task performance, which presents a disruptive insight into MLM training paradigms by suggesting that model capacity might track mask rate efficacy more closely than previously assumed. This observation raises questions about the traditional justifications for limiting mask rates, implying that recent advances in model architectures and capacities might facilitate learning under more challenging contexts.
Delving deeper into masking strategies, the paper establishes that different strategies demand different optimal rates. Uniform masking, which is simpler and more commonplace, benefits from higher masking rates compared to advanced strategies like span or PMI masking. With higher mask rates, uniform masking can cover more token spans and n-grams, thus inadvertently simulating the effects of more sophisticated strategies with less complexity.
Another pivotal contribution of the paper is the conceptual disentangling of masking into corruption and prediction rates. The corruption rate refers to the percentage of context tokens altered or removed, while the prediction rate denotes the percentage of tokens the model attempts to predict based on the corrupted context. Through ablation experiments, the paper illustrates how high prediction rates generate more learning signals and are advantageous, whereas high corruption rates increase task difficulty, leading to inadequate context for masking predictions. This analysis challenges our understanding of the masking mechanism, suggesting a nuanced interaction where prediction benefits might outweigh corruption disadvantages.
The authors also critically reevaluate the 80-10-10 corruption strategy invented by BERT, wherein a portion of masked tokens are replaced by the original or random tokens during pre-training. Empirically, they found that this strategy does not outperform models trained solely with mask tokens, suggesting that same-token substitutions and random corruptions may not be essential when MLMs are fine-tuned on complete, corruption-free contexts.
In summary, this paper provides compelling insights into MLM pre-training, suggesting that higher masking rates can be beneficial, particularly for larger models, and that the trade-offs between corruption and prediction rates should be considered in optimizing MLM training strategies. These findings have practical implications for improving training efficiency and theoretical implications for our understanding of LLM training processes. Future research could continue to explore the boundaries of masking practices, possibly leading to more optimized models that leverage higher masking rates without sacrificing performance accuracy.