Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PMI-Masking: Principled masking of correlated spans (2010.01825v1)

Published 5 Oct 2020 in cs.LG, cs.CL, and stat.ML

Abstract: Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked LLMs (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. PMI-Masking motivates, unifies, and improves upon prior more heuristic approaches that attempt to address the drawback of random uniform token masking, such as whole-word masking, entity/phrase masking, and random-span masking. Specifically, we show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yoav Levine (24 papers)
  2. Barak Lenz (8 papers)
  3. Opher Lieber (5 papers)
  4. Omri Abend (75 papers)
  5. Kevin Leyton-Brown (57 papers)
  6. Moshe Tennenholtz (97 papers)
  7. Yoav Shoham (22 papers)
Citations (68)