Whole-Word Masking (WWM) in NLP

Updated 21 October 2025

Whole-Word Masking (WWM) is a technique that masks complete words to maintain semantic integrity, preventing models from exploiting partial word cues during training.
WWM is particularly impactful in languages like Chinese where word boundaries are unclear, supporting models such as BERT-wwm and MacBERT through precise word segmentation.
Empirical evaluations show that WWM improves performance on NLP tasks, yielding higher Exact Match and F1 scores on benchmarks like CMRC 2018 and DRCD.

Whole-Word Masking (WWM) is a masked language modeling technique that, rather than independently masking subword units, systematically masks all tokens corresponding to an entire word when any of its composing tokens is selected. This method was designed to address the limitations of subword-level masking in pre-training LLMs, particularly for languages like Chinese where a "word" may span multiple characters and the demarcation of word boundaries is non-trivial. The adoption of WWM in Chinese BERT and its variants, as well as in domain-specific adaptations, has demonstrated measurable improvements in capturing semantic cohesion and enhancing downstream performance across a range of NLP tasks.

1. Core Principles and Motivation

Whole-Word Masking emerged in response to the observation that traditional BERT pre-training, which randomly masks subword units from a WordPiece vocabulary, enables the model to exploit partial information about words. For example, if only a suffix token in a split word is masked, the model may “cheat” by leveraging visible root or prefix tokens, facilitating easier recovery and weakening the semantic learning signal. WWM addresses this by enforcing that, if any subword belonging to a word (as defined by the tokenizer or, in Chinese, by word segmentation tools) is selected for masking, all subwords of that word are masked together. In English, this prevents predictions based solely on morpheme context; in Chinese, WWM is operationalized following pre-segmentation of the sentences, so multi-character words are masked as a unit, preserving their semantic integrity during pre-training (Cui et al., 2019).

This strategy aligns the masking process with genuine linguistic units, enabling the model to better capture compositional semantics, particularly important in morphologically rich or non-segmented languages.

2. Implementation in Chinese NLP and Model Variants

Implementing WWM in Chinese presents unique challenges due to the lack of whitespace-delimited word boundaries. The standard workflow involves the following steps:

Chinese Word Segmentation: Apply a segmentation tool (e.g., LTP or Texsmart) to identify word boundaries in raw Chinese text.
Tokenization Compatibility: Retain WordPiece or BPE tokenization for model compatibility, but associate masks based on segmented word boundaries.
WWM Application: At each masking iteration, sample a set of words (not individual tokens) according to the masking ratio (commonly 15%); for each sampled word, mask all constituent tokens.
Extended Strategies: Enhanced versions (e.g., in MacBERT) supplement WWM with N-gram masking, where the masking distribution is among unigrams, bigrams, and trigrams (e.g., 40%/30%/30%), increasing linguistic variance in pre-training data (Cui et al., 2019).

In the transformer forward pass, the input sequence $X$ is transformed through the embedding and transformer layers, while masked positions contribute to the MLM cross-entropy loss via:

$L = -\frac{1}{M} \sum_{i=1}^{M} y_i \log p_i$

where $M$ is the number of masked positions, $y_i$ is the true token, and $p_i$ is the predicted distribution over the vocabulary.

This approach was operationalized in a series of open-source models including BERT-wwm, RoBERTa-wwm, ELECTRA, RBT, and notably MacBERT, which extends masking strategies to closely match correction tasks during pre-training.

3. Empirical Impact and Comparative Evaluation

Empirical evaluations demonstrate that WWM generally yields improvements over subword- or character-masked models across numerous Chinese NLP tasks, including machine reading comprehension (MRC), classification, and sentence pair tasks (Cui et al., 2019). On datasets like CMRC 2018 and DRCD, WWM-enhanced models (e.g., BERT-wwm-ext, RoBERTa-wwm-ext) showed higher Exact Match (EM) and F1 scores compared to their non-WWM counterparts, and these gains persist in robustly optimized architectures such as MacBERT.

However, the significance of these improvements is task-dependent. For instance, sentence pair classification tasks displayed more moderate gains, likely due to their reliance on longer context and discourse structure, where word-level masking exerts less direct influence on the semantic objective. Nonetheless, on tasks sensitive to fine-grained word meaning or semantic compositionality, WWM delivers consistent and measurable gains.

4. Extensions and Adaptations in Specialized Domains

Domain-specific variations of WWM further emphasize its adaptability:

Biomedical Text Mining: WWM is extended to "whole entity masking" and "whole span masking," in which the units masked correspond to named biomedical entities or key multi-token phrases extracted with entity recognition and knowledge graphs. This proves especially beneficial for long-tail or rare terminology, as forcing the model to predict entire scientific entities improves performance in domain-specific language understanding and named entity recognition (e.g., MC-BERT on ChineseBLUE) (Zhang et al., 2020).
Sentiment Analysis: In domain-specific applications such as car review sentiment classification, WWM ensures that product names and terminology (often multi-character) are masked as a unit. Empirical results show improved classification accuracy and macro-F1, especially when coupled with further techniques such as adversarial training (Liu et al., 2022).

These results underscore that the core advantage of WWM lies in enforcing the integrity of semantic units relevant to the domain or downstream task, whether linguistically or contextually defined.

5. Masking Strategies: Comparisons and Design Choices

WWM is one of several sequence masking strategies that control the masking granularity in MLM objectives:

Strategy	Masking Unit	Application Context
Token Masking	Individual token/subword	Default BERT, minimal context
Whole-Word Masking	All tokens in a word	BERT-wwm, RoBERTa-wwm, MacBERT
Entity Masking	Named entity	Biomedical/NER models
Phrase Masking	Multi-token phrases	Specialized semantic tasks
Span Masking	Contiguous token spans	SpanBERT/Specialized contexts
PMI-Masking	Correlated n-grams (PMI)	Data-driven, adaptive masking

WWM sits between simple token masking and more contextually or statistically driven approaches (e.g., PMI-Masking (Levine et al., 2020)), unifying the benefits of masking coherent units while maintaining computational simplicity.
PMI-Masking extends WWM by masking any correlated span (not just words), and achieves comparable or better performance in fewer training steps by identifying collocations using pointwise mutual information.
Studies on masking length distribution further highlight that if the masking length distribution in MLM training matches that of target answer lengths in MRC datasets (e.g., short spans for extraction tasks), downstream performance is enhanced, with WWM being particularly effective for short-span or word-level answers (Zeng et al., 2021).

6. Limitations, Contingencies, and Hybridization

While WWM provides benefits on word-level semantic tasks, its appropriateness is not universal:

In character-based languages like Chinese, masking entire words risks losing fine-grained information needed for tasks such as single-character correction or insertion. Comparative probing studies show that character-level masking (CLM) outperforms WWM in single-character error correction, while WWM excels when corrections span multi-character words. Hybrid objectives (combining CLM and WWM) can offer a more robust solution across error scales (Dai et al., 2022).
For sequence-level or sentence-level downstream tasks, the differences between WWM and token-level masking become less pronounced after fine-tuning (Dai et al., 2022).

These findings suggest that the optimal masking granularity is task-dependent, with further room for adaptive strategies or curriculum-based approaches.

7. Practical Considerations, Dynamic Masking, and Resources

Recent advancements, such as in Chinese ModernBERT (Zhao et al., 14 Oct 2025), implement WWM using a hardware-aware BPE vocabulary that encompasses frequent Chinese affixes and compounds. The tokenization scheme marks word boundaries (e.g., via “##” prefixes), enabling accurate identification of word chunks for WWM. The model introduces a dynamic masking curriculum: the masking rate is increased to 30% during early training to challenge the model with more difficult prediction tasks, and then decays to 15% as the model transitions to refinement. This schedule aligns task difficulty with learning progress and helps optimize both global and local contextual prediction.

An open-source ecosystem of WWM-enabled pre-trained models is available for direct application and further research (e.g., https://github.com/ymcui/Chinese-BERT-wwm (Cui et al., 2019)).

8. Directions for Future Research

Findings from ablation studies highlight several areas for further exploration:

Systematic optimization of the masking ratio and masking granularity, moving beyond fixed heuristic settings.
Adaptive or learned masking strategies (e.g., gradient-based masking (Abdurrahman et al., 2023)), which dynamically select mask units according to criterion such as information content or downstream task relevance.
Integration with more sophisticated lexical representations, including the blending of word- and character-level semantics, ensemble segmentation to mitigate tokenization errors (Li et al., 2022), and extension to entity- or phrase-level masking in specialized domains (Zhang et al., 2020).

These directions reflect growing consensus that masking strategies—not just model scale and architecture—remain a significant factor in optimizing LLM pre-training for diverse languages and tasks.