Segment-Based Masking with Morpheme Boundaries

Updated 6 April 2026

The paper demonstrates that incorporating morpheme boundaries in segment-based masking improves language modeling accuracy compared to standard character-level masking.
It employs a self-supervised Transformer model that masks 25% of segments, aligning with morpheme boundaries to enhance reconstruction of linguistically meaningful subword units.
Empirical results show a consistent +0.58 percentage point gain in low-resource settings, highlighting its practical utility for morphologically rich languages.

Segment-based masking with morpheme boundaries is a self-supervised learning approach that leverages explicit morphological structure in token-level modeling tasks. Unlike uniform character-level masking, segment-based masking exploits available morpheme boundary annotations to mask and reconstruct contiguous morpheme-aligned spans, thus injecting inductive bias toward linguistically meaningful subword units. This approach is empirically validated to yield small but consistent improvements in low-resource morphological inflection and segmentation, providing a principled mechanism for modeling morphologically complex languages in data-scarce scenarios (Wiemerslage et al., 5 Jun 2025).

1. Formal Definition of Segment-Based Masking with Morpheme Boundaries

Let $x = x_1 x_2 \ldots x_n$ be a sequence of characters, and let $B = \{b_0=0, b_1, ..., b_k=n\}$ denote the indices of gold morpheme boundaries which define $k$ morpheme segments $S_i = x_{b_{i-1}+1} \ldots x_{b_i}$ , for $i = 1, \ldots, k$ . In segment-based masking, a subset $M \subseteq \{1, ..., k\}$ of exactly $\lceil 0.25 \cdot k\rceil$ segments is selected uniformly at random. For each segment $S_i$ with $i \in M$ , all constituent characters are replaced according to a standard CMLM masking recipe: 80% replaced with a [MASK] symbol, 10% with a random character, and 10% left unchanged.

The model is trained to reconstruct the masked characters, with loss

$L_{\rm CMLM} = -\sum_{i \in M}\sum_{j = b_{i-1}+1}^{b_i} \log P(x_j \mid x_{-M})$

where $B = \{b_0=0, b_1, ..., b_k=n\}$ 0 is the input with all characters in the masked segments replaced as described. This formulation differs from standard character-level masking by restricting mask spans to coincide strictly with morpheme boundaries, encoding morphological structure in the masking process (Wiemerslage et al., 5 Jun 2025).

2. Algorithmic Procedures and Model Integration

The segment-based masking procedure for each word with annotated morpheme boundaries proceeds as follows:

Parse Input: Identify segments $B = \{b_0=0, b_1, ..., b_k=n\}$ 1 from boundary indices $B = \{b_0=0, b_1, ..., b_k=n\}$ 2.
Sample Masked Segments: Compute mask count $B = \{b_0=0, b_1, ..., b_k=n\}$ 3 and choose $B = \{b_0=0, b_1, ..., b_k=n\}$ 4 segments randomly.
Apply Masking: For each position $B = \{b_0=0, b_1, ..., b_k=n\}$ $B = {b_{0} = 0, b_{1}, ..., b_{k} = n}$ 5 lying within a selected segment:
- 80% chance to replace $B = \{b_0=0, b_1, ..., b_k=n\}$ 6 with [MASK]
- 10% with a random character
- 10% unchanged
Model Ingestion: Input the masked sequence (optionally prepended with a special supervision tag) into an encoder-decoder Transformer (4 layers, 256-dim embeddings, 1024-d FFN, 4 heads).
Loss Computation: Compute the character-level masked language modeling loss as above; sum with any supervised inflection loss in multitask settings.
Optimization: Gradients from both supervised and self-supervised objectives are summed before parameter updates.

No curriculum beyond the fixed 25% mask-ratio is employed. Supervised and self-supervised examples are mixed within each minibatch (Wiemerslage et al., 5 Jun 2025).

3. Implementation and Practical Considerations

Boundary Annotation Source: For English, Hungarian, Italian, Russian, and Spanish, surface-segmented morpheme boundaries are derived from SIGMORPHON 2022 canonical-segmentation via Levenshtein alignment, implemented in Pynini.
Model Architecture: The technique is instantiated within a Transformer sequence-to-sequence model sharing parameters across masked language modeling and inflection objectives.
Resource Adaptation: Segment-based masking is applied exclusively to languages with available gold boundaries; in other cases, masking defaults to standard character-level sampling.
Low-Resource Suitability: Best applied when (a) gold or high-quality predicted morpheme boundaries exist and (b) the available corpus contains at least 5,000 unique types. Under extreme data scarcity, vanilla autoencoding objectives tend to outperform segment-based masking due to stronger copy bias induction (Wiemerslage et al., 5 Jun 2025).

4. Empirical Results and Comparative Performance

The method achieves consistent improvements over character-level masking in morphologically inflected language modeling. On five languages with reliable segmentation, mean accuracy improves from 69.98% (CMLM-iid) to 70.56% (CMLM-seg-iid), corresponding to a +0.58 pp absolute gain (Wiemerslage et al., 5 Jun 2025). This improvement is robust across five languages and five random seeds, though statistical significance tests are not reported.

By contrast, adaptation of segment masking to T5-style span denoising models does not yield further gains, and can underperform relative to character-level baselines. These results indicate the benefit of segment-based masking is currently clearest in CMLM within an encoder-decoder setup under multitask training with self-supervised and supervised examples (Wiemerslage et al., 5 Jun 2025).

5. Inductive Biases, Trade-offs, and Limitations

Inductive Bias: The approach implements a moderate inductive bias toward morpheme-aligned spans, which is beneficial in languages characterized by rich concatenative suffixation, provided a sufficiently diverse set of masked segments is observed during training.
Data Sufficiency: Too little unlabelled data can lead to overfitting on a small set of segment types.
Boundary Supervision: Effectiveness presupposes supervised or high-quality predicted segmentation. Model performance in the absence of such supervision defaults to character-level CMLM, or relies upon unsupervised morpheme boundary induction (such as span-masked transformer models or unsupervised tree tokenizers) (Wiemerslage et al., 5 Jun 2025, Downey et al., 2021, Zhu et al., 2024).
Transition to Full-Unsupervised: Future developments may substitute gold boundaries with boundaries predicted by models such as Masked Segmental LLMs or Morphological Tree Tokenizers (Downey et al., 2021, Zhu et al., 2024), potentially enabling fully unsupervised segment-based masking schedules.
Best Practices: Segment masking is most effective when gold boundaries are reliably available and the training corpus is of moderate size; in ultra-low-resource regimes, conventional auto-encoding should be preferred for its stronger copy bias (Wiemerslage et al., 5 Jun 2025).

6. Relation to Unsupervised Segmentation and Morphological Modeling

Segment-based masking is part of a broader class of approaches focusing on segmental structure in language modeling. Unlike unsupervised segmental models—such as Masked Segmental LLMs, which use span-masked transformer encoders and dynamic programming for segmentation (Downey et al., 2021), or Morphological Tree Tokenizers, which induce trees with mechanisms explicitly preventing morpheme decomposition (MorphOverriding) (Zhu et al., 2024)—the segment-based masking approach requires gold or high-quality predicted boundaries. Empirical findings show that methods emphasizing morpheme boundary preservation outperform standard tokenizers (e.g., BPE, WordPiece) on segmentation and language modeling metrics (Zhu et al., 2024), supporting the view that alignment of masking or segmentation units to linguistic morphology is critical for both interpretability and performance in morphologically complex languages.

A plausible implication is that future segment-based masking variants may benefit from tighter integration with unsupervised or semi-supervised segmentation methodologies, yielding enhanced morphological generalization even in the absence of large-scale annotated resources.