Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamic Masking Rate Schedules for MLM Pretraining (2305.15096v3)

Published 24 May 2023 in cs.CL and cs.AI

Abstract: Most works on transformers trained with the Masked LLMing (MLM) objective use the original BERT model's fixed masking rate of 15%. We propose to instead dynamically schedule the masking rate throughout training. We find that linearly decreasing the masking rate over the course of pretraining improves average GLUE accuracy by up to 0.46% and 0.25% in BERT-base and BERT-large, respectively, compared to fixed rate baselines. These gains come from exposure to both high and low masking rate regimes, providing benefits from both settings. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked LLMs, achieving up to a 1.89x speedup in pretraining for BERT-base as well as a Pareto improvement for BERT-large.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zachary Ankner (10 papers)
  2. Naomi Saphra (34 papers)
  3. Davis Blalock (10 papers)
  4. Jonathan Frankle (37 papers)
  5. Matthew L. Leavitt (9 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com