Papers
Topics
Authors
Recent
2000 character limit reached

AntLM LTG-BERT: Alternating CLM & MLM

Updated 19 November 2025
  • The paper introduces an alternating training regime that leverages CLM for rapid syntactic convergence and MLM for robust semantic learning.
  • The methodology utilizes epoch-wise alternation with causal masks in initial/final phases and bidirectional masks during core training to enhance performance.
  • Experimental results from the BabyLM Challenge 2024 show a +2.2 macro-averaged improvement over baseline, confirming the efficacy of the hybrid objective approach.

AntLMLTG-BERT_\text{LTG-BERT} is a language modeling paradigm designed to synthesize the strengths of both Causal Language Modeling (CLM) and Masked Language Modeling (MLM) within the LTG-BERT encoder-only architecture. In this framework, model parameters and architecture remain fixed, while the training objective and attention mask are alternated epoch-wise to leverage rapid syntactic convergence from CLM and robust semantic learning from MLM. AntLMLTG-BERT_\text{LTG-BERT} was evaluated in the context of the BabyLM Challenge 2024, demonstrating superior macro-averaged performance compared to pure CLM or MLM approaches and confirming the practical benefits of mixing autoregressive and bidirectional objectives (Yu et al., 4 Dec 2024).

1. Architectural Foundations

AntLMLTG-BERT_\text{LTG-BERT} builds on LTG-BERT, a data-efficient, encoder-only Transformer optimized for linguistic benchmarks and trained on the strictly curated British National Corpus (BNC) (Samuel et al., 2023). Key architectural features are:

  • Transformer Encoder Stack: 12 layers, each with hidden size 768 and 12 attention heads.
  • Feed-forward Blocks: Intermediate size is 2048, with NormFormer layer normalization and GLU/GEGLU activations for improved training stability and nonlinearity.
  • Relative Position Embeddings: Disentangled position embeddings bucketed into 32 groups, following DeBERTa style for robust language structure representation.
  • Weight Initialization and Scaling: Initialization as N(0,1/d)\mathcal{N}(0,1/\sqrt{d}); FF layer scaling by 1/l1/\sqrt{l}.

AntLM retains all LTG-BERT parameters and components. The only modification is the alternation between causal and bidirectional attention masks corresponding to CLM and MLM, respectively.

2. Training Objectives and Losses

The alternated regime fundamental to AntLMLTG-BERT_\text{LTG-BERT} is realized as follows:

  • Causal LM (CLM): Predict next tokens in left-to-right order without input masking and with a lower-triangular attention mask.

LCLM(θ)=t=1TlogPθ(xtx1:t1)L_{CLM}(\theta) = - \sum_{t=1}^T \log P_\theta(x_t \mid x_{1:t-1})

  • Masked LM (MLM): Predict only masked tokens (15% selected, 80% replaced by [MASK], 10% replaced randomly, 10% unchanged), under full bidirectional attention.

LMLM(θ)=iMlogPθ(xiy^)L_{MLM}(\theta) = - \sum_{i \in M} \log P_\theta(x_i \mid \hat y)

where MM is the set of masked positions and y^\hat y is the corrupted input.

  • Combined Loss Schedule:

Alternation is epoch-wise with:

LAntLM(θ;t)=α(t)LCLM(θ)+(1α(t))LMLM(θ)L_{AntLM}(\theta; t) = \alpha(t) L_{CLM}(\theta) + (1 - \alpha(t)) L_{MLM}(\theta)

Optimal α(t)\alpha(t) schedule: - Epochs 1–6: CLM - Epochs 7–66: MLM - Epochs 67–72: CLM

This strategy eschews a per-step mixture in favor of strict epoch alternation, leveraging fast structural learning of CLM at the boundaries and the context sensitivity of MLM in the core training period (Yu et al., 4 Dec 2024).

3. Attention Masking and Alternation Mechanism

Formally, the two mask types are:

  • Causal Mask MCM^C: Mi,jC=0M^C_{i,j} = 0 if jij \leq i, -\infty otherwise; restricts each token’s attention to itself and its past.
  • Bidirectional Mask MBM^B: Mi,jB=0M^B_{i,j} = 0 for all (i,j)(i,j); allows full attention across all tokens.

The alternation is performed at the epoch level, not per batch:

Phase Epochs Attention Mask Input Masking Objective
Initial CLM 1–6 Causal (MCM^C) None CLM
MLM Core 7–66 Bidirectional 15% masked MLM
Final CLM 67–72 Causal (MCM^C) None CLM

This approach utilizes autoregressive learning to rapidly encode syntactic/structural information at the start and end, with bidirectional context modeling dominating the main training phase, leading to improved downstream performance (Yu et al., 4 Dec 2024).

4. Training Setup and Evaluation

Experiments were conducted on the BabyLM 2024 “strict-small” track (≈10M words, BootBERT pipeline) with the following setup:

  • Vocabulary Size: 16k subwords (LTG-BERT tokenizer)
  • Batch Size: 1024 sequences
  • Optimizer: AdamW (β1=0.9,β2=0.999\beta_1=0.9, \beta_2=0.999, weight decay = 0.01)
  • Precision: bfloat16
  • Learning Rate: 5×1045 \times 10^{-4} initial, cosine with restarts
  • Epochs: 72 (6 CLM + 60 MLM + 6 CLM)

Evaluation suites include BLiMP (syntactic minimal pair judgments), BLiMP_Supp, EWoK, and GLUE. Macro-averaged scores and ablation results illustrate the superiority of the alternated regime:

Model BLiMP BLiMP_Supp EWoK GLUE Macro-Avg
LTG-BERT (baseline) 62.6 65.4 62.3 64.9 63.8
AntLMLTG-BERT_\text{LTG-BERT} (6+60+6) 72.3 62.6 63.0 66.0 66.0

Largest gain observed on BLiMP (+9.7), with macro-average improvement of +2.2 over baseline, confirming the environmental and architectural impact of alternation (Yu et al., 4 Dec 2024).

5. Ablation Studies and Analytical Insights

Comparative ablation demonstrates critical distinctions:

Training Mode BLiMP BLiMP_Supp EWoK Avg
12_CLM only 69.9 56.4 50.8 59.0
60_MLM only 62.8 63.5 64.2 63.5
72_CLM only 70.0 57.2 51.9 57.9
72_MLM only 69.4 61.1 64.5 65.0
6_CLM+60_MLM+6_CLM 72.3 62.6 63.0 66.0

Key findings:

  • CLM converges rapidly per epoch (predicts all tokens), advantageous for structural/syntactic representation.
  • MLM, updating only masked tokens, better captures distributed contextual semantics.
  • Placing CLM objectives at both boundaries of training is essential for maximizing both syntactic and sequential predictive capabilities.
  • The epoch-wise alternation yields superior macro-averaged benchmark scores relative to pure regimes (Yu et al., 4 Dec 2024).

6. Broader Implications and Future Research

Application of AntLM’s alternating regime is not limited to LTG-BERT. The paradigm is compatible with encoder-only LMs such as RoBERTa, DeBERTa, and SpanBERT, and potentially with encoder-decoder hybrids (e.g., T5) by flexible switching of attention masks and objectives. Further, finer-grained alternation strategies (e.g., batch-level or adaptive α(t)\alpha(t)) are plausible routes for future exploration. Extension to multilingual or multimodal contexts may similarly benefit from objective alternation. The synergy between CLM and MLM indicates that no singular objective is optimal for all linguistic phenomena; thus, hybrid approaches are emerging as a prominent method for data-efficient and robust pretraining (Yu et al., 4 Dec 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AntLM$_\text{LTG-BERT}$.