Masked Language Models (MLMs)

Updated 24 December 2025

Masked Language Models (MLMs) are bidirectional Transformer-based pretraining methods that predict masked tokens to build rich, contextual representations for various applications.
They employ a unique masking strategy where selected tokens are replaced with a [MASK] token and reconstructed using both left and right context, with metrics like PLL evaluating performance.
MLMs drive advances in data augmentation, text generation, and multimodal fusion, while ongoing research addresses challenges like semantic ambiguity, bias, and representation deficiencies.

Masked LLMs (MLMs) are a pretraining paradigm wherein a neural network, typically based on a bidirectional Transformer encoder, is trained to reconstruct masked-out tokens in input sequences by conditioning on their surrounding context. Since their introduction, MLMs have underpinned major advances in self-supervised representation learning for natural language, code, and biological sequences. This approach has enabled deep models to acquire powerful, generalizable contextual representations, supporting state-of-the-art results across a wide spectrum of natural-language understanding tasks and beyond.

1. Formal Definition and Training Objective

Given an input sequence $X = (x_1, \dots, x_n)$ , a random subset of positions $M \subset \{1, \dots, n\}$ is selected. Each $x_i$ for $i \in M$ is substituted by a special [MASK] token to form the corrupted input $\tilde X$ . The MLM objective is to maximize the conditional log-likelihood of the original tokens at the masked positions, i.e.,

$\mathcal{L}_{\text{MLM}}(\theta) = -\mathbb{E}_{X, M}\left[\frac{1}{|M|} \sum_{i \in M} \log P_\theta(x_i \mid \tilde X)\right]$

where $\theta$ denotes model parameters. This task compels the encoder to integrate both left and right context, yielding deep bidirectional context-dependent token representations (Ma, 2023).

Standard masking strategies in text models, such as BERT, use $|M|/n \approx 0.15$ . Advanced variants adapt the masking structure for domain specificity, span prediction, or semantics-driven masking in cross-modal contexts (Bitton et al., 2021).

2. Probabilistic Structure and Scoring Metrics

Unlike autoregressive models, MLMs do not explicitly define a consistent joint distribution over complete sequences. Instead, each masked token conditional $P_\theta(x_i \mid \tilde X)$ is modeled given all unmasked context, but these conditionals can be mutually incompatible across masking patterns (Young et al., 2022, Hennigen et al., 2023). This has motivated specialized scoring metrics:

Metric	Mathematical Definition	Key Properties
PLL	$\sum_{i=1}^{n} \log P_\theta(x_i \mid X_{[i=\text{[MASK]}]})$	Sums conditionals by masking each token in turn
PPPL	$\exp(-\frac{1}{N}\sum_{x\in \mathcal{W}} PLL(x))$	Pseudo-perplexity over a dataset
PLL-l2r	$\sum_{words} \sum_{t} \log P_\theta(s_{w,t} \mid S_{\setminus \{s_{w,t'}\|t'\geq t\}})$	Masks all rightward subtokens within each word
AUL/AULA	$\frac{1}{\|S\|} \sum_{i=1}^{\|S\|} \alpha_i \log P_\theta(w_i\|S)$	Predict all tokens, weighted by attention

PLL, as shown in (Salazar et al., 2019), closely approximates unsupervised acceptability and outperforms autoregressive alternatives for grammaticality judgments, rescoring, and pseudo-probability applications. Modifications such as PLL-word-l2r correct for overestimated plausibility of multi-token/OOV words (Kauf et al., 2023). The incompatibility of MLM conditionals with a true joint is empirically verified—the so-called “inconsistency phenomenon”—which impacts confidence estimation and inference (Young et al., 2022, Hennigen et al., 2023).

3. Model Architectures and Representation Dynamics

MLMs are typically realized as encoder-only bidirectional Transformers. During pretraining, models process the corrupted sequence and learn to predict masked tokens. This scheme creates a unique [MASK] embedding not encountered in downstream tasks, inducing a subspace mismatch between pretraining and finetuning (Meng et al., 2023, Zheng et al., 23 Jan 2025).

Two principal deficiencies have emerged:

Representation deficiency: MLM pretraining allocates embedding subspace to [MASK], effectively reducing dimensionality available to real tokens and yielding suboptimal expressivity (Meng et al., 2023).
Corrupted semantics: Masking can render the context ambiguous, supporting multiple plausible token completions and increasing prediction entropy. The “repeated-MLM” experiment, for example, demonstrates that semantic corruption, not just the rate of [MASK] tokens, drives prediction uncertainty and degrades downstream accuracy (Zheng et al., 23 Jan 2025).

Architectural innovations have mitigated these issues. MAE-LM eliminates [MASK] from encoder inputs entirely, using a separate prediction decoder, thus restoring full dimension utilization for real tokens (Meng et al., 2023). ExLM augments each masked token to multiple latent states, increasing context capacity and directly reducing semantic multimodality, demonstrated by lower reconstruction entropy and improved downstream task performance (Zheng et al., 23 Jan 2025).

4. Practical Variants and Extensions

Numerous enhancements and alternatives have been proposed:

Efficient pretraining: By delaying the insertion of [MASK] tokens to later layers, computational savings of up to 32% are achieved with no loss on GLUE, using separate encoder/decoder stages and high masking rates (Liao et al., 2022).
Noise-robust pretraining: Warped LLMs (WLMs) introduce INSERT and DROP operations to simulate ASR noise (insertions, deletions) in addition to MASK, KEEP, and RAND, thereby increasing SLU robustness under noisy transcriptions (Namazifar et al., 2020).
Contrastive objectives: Mirror-BERT leverages self-supervised contrastive identity pairing, dramatically increasing semantic similarity performance at both lexical and sentential levels without manual annotation (Liu et al., 2021).
Auxiliary and alternative objectives: “Manipulated word detection” (SHUFFLE+RANDOM), token type prediction, and first-character prediction offer computational advantages and competitive results, especially in reduced-parameter regimes (Yamaguchi et al., 2021).
Representation learning extensions: TACO introduces a global sentence-level contrastive term, shown to accelerate convergence and improve contextual integration (Fu et al., 2022).

MLMs have broad utility beyond classical language understanding:

Data augmentation: Mask-based augmentation leverages pretrained MLMs to generate synthetic training data by masking and filling labeled examples, improving accuracy and generalization in data-scarce regimes across classification, NER, QA, and summarization (Ma, 2023).
Text generation: Iterative “mask-predict” decoding enables flexible token generation anywhere in the sequence, often producing more coherent and human-acceptable outputs than left-to-right autoregressors for medical, narrative, and authorship-verification domains. Such generation is found superior across BLEU, ROUGE, METEOR, and BERTScore metrics, and is robust in downstream transfer tasks (Micheletti et al., 21 May 2024).
Vision+Language modeling: In LXMERT and variants, semantics-aware masking strategies (object tokens, high-concreteness words) drive more effective multimodal pretraining, maximizing utility from the visual context, especially in low-resource settings (Bitton et al., 2021).
Unsupervised editing and style transfer: Padded-MLMs handle variable-length infills and enable unsupervised transformations between textual domains using domain-conditioned likelihood disagreement (Malmi et al., 2020).

6. Bias, Fairness, and Intrinsic Evaluation

MLMs encode and may amplify real-world social biases absorbed from pretraining corpora (Liu et al., 2023). Early metrics quantified bias using only masked pseudo-likelihood, but recent work advocates holistic measures that assess distributions over log-likelihoods for stereotype and anti-stereotype samples, modeling them as Gaussians and comparing via Kullback–Leibler and Jensen–Shannon divergences (KLDivS, JSDivS) (Liu et al., 2023). These provide interpretable, stable, and variance-sensitive bias assessments.

In multilingual contexts, the Multilingual Bias Evaluation (MBE) score is a scalable method requiring only English attribute lists and parallel corpora, enabling cross-lingual bias detection without manual annotation (Kaneko et al., 2022). For intrinsic model auditing, metrics such as AUL (All Unmasked Likelihood) and AULA (attention-weighted variant) offer more robust, frequency-independent bias measurements than prior masked-only approaches (Kaneko et al., 2021).

7. Limitations and Ongoing Developments

MLMs provide incomplete joint sequence models: the set of conditionals cannot in general be derived from a true probability distribution, leading to practical consequences such as inconsistent predictions for different masking patterns (Young et al., 2022, Hennigen et al., 2023). The “Ensemble of Conditionals” (EOC) strategy aggregates predictions from multiple mask configurations, recovering up to 3% accuracy improvements with minimal inference overhead.

Other limitations include: masking-induced context corruption especially for heavy masking rates; inefficiency of token-by-token PLL scoring (each requiring a full inference pass); and discrepancies introduced by a [MASK] embedding not present during downstream use. Hybrid architectures, span-based masking (as in T5), and explicit context enhancement (as in ExLM) are active areas of research.

These developments collectively demonstrate that masked language modeling is not merely a “fill-in-the-blank” pretext, but a deeply influential methodological framework. It continues to evolve as its weaknesses are understood and as new challenges—including efficient pretraining, generative modeling, robustness, and fairness—are addressed via advanced architectural modifications and objective function design.