Keyword-Centric Masking

Updated 11 December 2025

Keyword-centric masking is a method that prioritizes statistically and contextually important tokens to focus neural network training on key domain-specific features.
It integrates techniques such as TF–IDF, embedding-based relevance, transformer attention, and reinforcement learning to strategically select tokens for enhanced model performance.
Empirical improvements include gains in domain adaptation, robust text classification, and improved speech enhancement metrics, demonstrating practical benefits over random masking.

Keyword-centric masking refers to a family of methods in which masking, weighting, or selection mechanisms within neural network training are strategically centered on “keywords”—tokens or features regarded as pivotal for discriminative, generative, or denoising tasks. In contrast to conventional random masking, these approaches select or prioritize the masking of tokens identified via their statistical, contextual, or task-specific salience, thereby sharpening model focus on domain- or context-relevant representations. This paradigm has seen successful applications in domain-adaptive pre-training, robust text classification, reinforcement learning-based model adaptation, and targeted speech separation and beamforming.

1. Approaches to Keyword Identification and Mask Selection

Keyword-centric masking begins with defining what constitutes a “keyword” within the target task or domain. Strategies vary by modality:

Statistical metrics: Classical approaches use term frequency-inverse document frequency (TF–IDF), which computes, for each token $t$ in document $d$ ,

$w_{t,d} = \mathrm{tf}(t,d) \cdot \log\frac{N}{\mathrm{df}_t}$

where $\mathrm{tf}(t, d)$ is document frequency, $\mathrm{df}_t$ is document count, and $N$ is corpus size (Belfathi et al., 19 Feb 2024).

Contextual or embedding-based relevance: For domain adaptation, KeyBERT is leveraged to extract the top- $n$ unigrams per document based on cosine similarity between token and document embeddings, with Maximal Marginal Relevance (MMR) for diversity control (Golchin et al., 2023).
Transformer attention and class relevance: In “MASKER,” attention patterns and class-conditional frequencies are aggregated to select high-impact tokens for classification robustness (Moon et al., 2020).
Policy networks and RL: Neural Mask Generator uses a Transformer-style policy network parameterized by $\lambda$ that directly scores token-level importance, optimizing mask selection through reinforcement learning, using validation task performance as reward signal (Kang et al., 2020).

The mask selection procedure typically replaces the standard uniform random token selection with a mechanism that either always chooses the top-scoring words (“TopN”) or samples proportional to the keyword scores (“Rand”; see Table 1 for summary):

Method	Scoring/Selection Strategy	Typical Application
TF–IDF TopN	Highest TF–IDF per doc/corpus	Domain adaptation
Embedding-based	KeyBERT + MMR	In-domain pre-training
Attention-based	Transformer self-attention patterns	Text classification
RL policy	Learned by reward via task metric	Task-specific MLM

2. Integration into Model Training Objectives

In masked language modeling (MLM), the selection of masked tokens determines which parts of the input the model is forced to reconstruct. Keyword-centric masking replaces the random mask selection $M$ in the standard loss

$\mathcal{L}_{MLM} = -\sum_{t \in M} \log p(x_t \mid x_{\setminus M})$

by focusing $M$ on tokens scored highest by the selected keyword identification scheme.

For in-domain adaptation, masking indicators $m_i \sim \mathrm{Bernoulli}(p_{\text{mask}})$ are sampled for positions $x_i \in \mathcal{K}$ , where $\mathcal{K}$ is the auto-identified keyword set, typically with $p_{\text{mask}}=0.75$ ; the replacement policy defaults to BERT’s (80% [MASK], 10% random, 10% unchanged) (Golchin et al., 2023). No changes are made to the optimizer or base architecture; only the mask selection logic is altered.

In fine-tuning for robust classification, auxiliary losses such as masked keyword reconstruction (MKR) and masked entropy regularization (MER) are used:

$L_{\text{MKR}} = \sum_{i \in \tilde{K}} -\log P(x_i \mid x^m; \theta), \quad L_{\text{MER}} = KL(U(\cdot) \,\Vert\, f_{\text{doc}}(x^-))$

where $\tilde{K}$ and $x^m$ are masked keyword positions/variants, and $x^-$ is the low-context variant. The final loss is $L_{\text{total}} = L_{CE} + \lambda_{\text{MKR}} L_{\text{MKR}} + \lambda_{\text{MER}} L_{\text{MER}}$ (Moon et al., 2020).

Policy/game-theoretic variants perform mask selection via MDPs, where policy networks assign probability mass over token positions, and reward is computed from downstream validation performance. RL algorithms update mask generation policies accordingly (Kang et al., 2020).

3. Empirical Performance and Applications

Keyword-centric masking has shown consistent empirical gains across modalities:

LLM domain adaptation: Selective masking based on TF–IDF or KeyBERT outperforms random masking in continual pre-training on domain-specific corpora. On LegalGLUE, TF–IDF–Rand masking yields up to +2.5 m-F1 over random masking for BERT and LegalBERT (Belfathi et al., 19 Feb 2024). On IMDB and Amazon reviews, keyword masking improves accuracy by up to +0.8 pp and macro-F1 by +0.9 pp, with statistical significance (Golchin et al., 2023).
Text classification under domain shift and OOD: MASKER improves OOD AUROC by 2–14 points across numerous benchmark shifts, often reducing cross-domain generalization gaps by 40–60% compared to vanilla BERT fine-tuning, with no degradation in in-domain accuracy (Moon et al., 2020).
Self-supervised RL adaptation: Neural Mask Generator delivers small but consistent gains (e.g., +0.37 F1 on NewsQA, +0.66 F1 on emrQA), outperforming entity- and span-based maskers by targeting “answer-bearing” tokens (Kang et al., 2020).
Speech enhancement and ASR: In the context of speaker-selective beamforming, DNNs trained to estimate time–frequency masks focusing on “wakeup” keywords significantly improve beamformer spatial filter estimation, reducing ASR CER by 9.7–23.6% on real audio compared to standard blind beamforming (Kida et al., 2018).

4. Limitations and Computational Considerations

Keyword-centric masking methods are subject to several constraints:

Keyword identification: Supervised strategies (e.g., attention-based) require a trained or partially trained model, while unsupervised (e.g., TF–IDF, KeyBERT) depend on reliable corpus-level statistics and may not generalize to out-of-domain data without adjustment (Golchin et al., 2023, Belfathi et al., 19 Feb 2024).
Overfitting and mask diversity: Static mask sets risk overfitting. Randomized (“Rand”) sampling per epoch mitigates this but may reduce masking focus (Belfathi et al., 19 Feb 2024).
Computation: Keyword extraction (e.g., KeyBERT) incurs up to 15% pre-training time overhead for BERT-Large but diminishes with longer training schedules (Golchin et al., 2023). RL-based approaches are computationally intensive due to environment sampling and replay (Kang et al., 2020).
Data constraints: Methods relying on genre- or domain-homogeneous corpora for statistics may underperform in heterogeneous or resource-poor regimes (Belfathi et al., 19 Feb 2024).

5. Analysis, Variations, and Future Directions

Empirical analyses reveal several key points:

Calibration and decision confidence: Keyword-centric masking (e.g., MASKER) improves OOD calibration, sharply delineating in- and out-domain examples in the embedding space and lowering overconfidence on low-context or nonsensical inputs (Moon et al., 2020).
POS and semantic focus: RL/policy-based masking strongly favors content words—nouns, proper nouns, and verbs—over stop-words, resulting in more discriminative or answer-bearing masking (Kang et al., 2020).
Strategy variants: Both “TopN” and “Rand” selection schemes have merit; TopN concentrates loss on frequently high-scored tokens, Rand yields more coverage and regularizes mask learning (Belfathi et al., 19 Feb 2024).
Integration with external signals: Extensions could involve hybrid masking strategies using external gazetteers, entity detection, or dynamic loss shaping as training proceeds, as suggested in legal and scientific domain adaptation settings (Belfathi et al., 19 Feb 2024).
End-to-end learning: Moving from heuristic to learned mask scoring—via small auxiliary networks or meta-learning—remains a plausible direction and is suggested as a way to further optimize which tokens are treated as “hard” during adaptation (Belfathi et al., 19 Feb 2024).

Keyword-centric masking is not restricted to language modeling:

Speech enhancement: By temporally aligning mask estimation to externally detected “wakeup” keywords, speaker beamformer systems avoid permutation ambiguity in source separation, yielding superior estimation of spatial covariance and robust target speech extraction (Kida et al., 2018).
Sequence classification with low-context regularization: Masking that targets keywords (and, symmetrically, removes non-keywords at high rates) regularizes model confidence and trains the classifier to express uncertainty when presented with insufficient input (Moon et al., 2020).

7. Summary Table: Methodological Landscape

Domain	Masking Strategy	Mask Score Type	Outcome/Metric
NLP (Adaptation)	TF–IDF / KeyBERT (TopN/Rand)	Corpus statistics/semantic	+0.5–2.5 m-F1 (LegalGLUE)
NLP (Classification)	MASKER, OOD-focused	Attention/class stats	+2–14 AUROC/OOD
LM Training	RL policy (NMG)	Reward via task metric	+0.4–0.7 F1 (QA)
Speech	DNN-masked time–freq filtering	Keyword vs. background	–9.7–23.6% CER (ASR)

Keyword-centric masking thus comprises a diverse yet unified set of approaches that structurally re-weight or focus masking for improved domain adaptation, robustness, or separation quality, with well-characterized empirical and methodological trade-offs.