AntLM LTG-BERT: Alternating CLM & MLM

Updated 19 November 2025

The paper introduces an alternating training regime that leverages CLM for rapid syntactic convergence and MLM for robust semantic learning.
The methodology utilizes epoch-wise alternation with causal masks in initial/final phases and bidirectional masks during core training to enhance performance.
Experimental results from the BabyLM Challenge 2024 show a +2.2 macro-averaged improvement over baseline, confirming the efficacy of the hybrid objective approach.

AntLM $_\text{LTG-BERT}$ is a language modeling paradigm designed to synthesize the strengths of both Causal Language Modeling (CLM) and Masked Language Modeling (MLM) within the LTG-BERT encoder-only architecture. In this framework, model parameters and architecture remain fixed, while the training objective and attention mask are alternated epoch-wise to leverage rapid syntactic convergence from CLM and robust semantic learning from MLM. AntLM $_\text{LTG-BERT}$ was evaluated in the context of the BabyLM Challenge 2024, demonstrating superior macro-averaged performance compared to pure CLM or MLM approaches and confirming the practical benefits of mixing autoregressive and bidirectional objectives (Yu et al., 2024).

1. Architectural Foundations

AntLM $_\text{LTG-BERT}$ builds on LTG-BERT, a data-efficient, encoder-only Transformer optimized for linguistic benchmarks and trained on the strictly curated British National Corpus (BNC) (Samuel et al., 2023). Key architectural features are:

Transformer Encoder Stack: 12 layers, each with hidden size 768 and 12 attention heads.
Feed-forward Blocks: Intermediate size is 2048, with NormFormer layer normalization and GLU/GEGLU activations for improved training stability and nonlinearity.
Relative Position Embeddings: Disentangled position embeddings bucketed into 32 groups, following DeBERTa style for robust language structure representation.
Weight Initialization and Scaling: Initialization as $\mathcal{N}(0,1/\sqrt{d})$ ; FF layer scaling by $1/\sqrt{l}$ .

AntLM retains all LTG-BERT parameters and components. The only modification is the alternation between causal and bidirectional attention masks corresponding to CLM and MLM, respectively.

2. Training Objectives and Losses

The alternated regime fundamental to AntLM $_\text{LTG-BERT}$ is realized as follows:

Causal LM (CLM): Predict next tokens in left-to-right order without input masking and with a lower-triangular attention mask.

$L_{CLM}(\theta) = - \sum_{t=1}^T \log P_\theta(x_t \mid x_{1:t-1})$

Masked LM (MLM): Predict only masked tokens (15% selected, 80% replaced by [MASK], 10% replaced randomly, 10% unchanged), under full bidirectional attention.

$L_{MLM}(\theta) = - \sum_{i \in M} \log P_\theta(x_i \mid \hat y)$

where $M$ is the set of masked positions and $\hat y$ is the corrupted input.

Combined Loss Schedule:

Alternation is epoch-wise with:

$L_{AntLM}(\theta; t) = \alpha(t) L_{CLM}(\theta) + (1 - \alpha(t)) L_{MLM}(\theta)$

Optimal $\alpha(t)$ schedule: - Epochs 1–6: CLM - Epochs 7–66: MLM - Epochs 67–72: CLM

This strategy eschews a per-step mixture in favor of strict epoch alternation, leveraging fast structural learning of CLM at the boundaries and the context sensitivity of MLM in the core training period (Yu et al., 2024).

3. Attention Masking and Alternation Mechanism

Formally, the two mask types are:

Causal Mask $M^C$ : $M^C_{i,j} = 0$ if $j \leq i$ , $-\infty$ otherwise; restricts each token’s attention to itself and its past.
Bidirectional Mask $M^B$ : $M^B_{i,j} = 0$ for all $(i,j)$ ; allows full attention across all tokens.

The alternation is performed at the epoch level, not per batch:

Phase	Epochs	Attention Mask	Input Masking	Objective
Initial CLM	1–6	Causal ( $M^C$ )	None	CLM
MLM Core	7–66	Bidirectional	15% masked	MLM
Final CLM	67–72	Causal ( $M^C$ )	None	CLM

This approach utilizes autoregressive learning to rapidly encode syntactic/structural information at the start and end, with bidirectional context modeling dominating the main training phase, leading to improved downstream performance (Yu et al., 2024).

4. Training Setup and Evaluation

Experiments were conducted on the BabyLM 2024 “strict-small” track (≈10M words, BootBERT pipeline) with the following setup:

Vocabulary Size: 16k subwords (LTG-BERT tokenizer)
Batch Size: 1024 sequences
Optimizer: AdamW ( $\beta_1=0.9, \beta_2=0.999$ , weight decay = 0.01)
Precision: bfloat16
Learning Rate: $5 \times 10^{-4}$ initial, cosine with restarts
Epochs: 72 (6 CLM + 60 MLM + 6 CLM)

Evaluation suites include BLiMP (syntactic minimal pair judgments), BLiMP_Supp, EWoK, and GLUE. Macro-averaged scores and ablation results illustrate the superiority of the alternated regime:

Model	BLiMP	BLiMP_Supp	EWoK	GLUE	Macro-Avg
LTG-BERT (baseline)	62.6	65.4	62.3	64.9	63.8
AntLM $_\text{LTG-BERT}$ (6+60+6)	72.3	62.6	63.0	66.0	66.0

Largest gain observed on BLiMP (+9.7), with macro-average improvement of +2.2 over baseline, confirming the environmental and architectural impact of alternation (Yu et al., 2024).

5. Ablation Studies and Analytical Insights

Comparative ablation demonstrates critical distinctions:

Training Mode	BLiMP	BLiMP_Supp	EWoK	Avg
12_CLM only	69.9	56.4	50.8	59.0
60_MLM only	62.8	63.5	64.2	63.5
72_CLM only	70.0	57.2	51.9	57.9
72_MLM only	69.4	61.1	64.5	65.0
6_CLM+60_MLM+6_CLM	72.3	62.6	63.0	66.0

Key findings:

CLM converges rapidly per epoch (predicts all tokens), advantageous for structural/syntactic representation.
MLM, updating only masked tokens, better captures distributed contextual semantics.
Placing CLM objectives at both boundaries of training is essential for maximizing both syntactic and sequential predictive capabilities.
The epoch-wise alternation yields superior macro-averaged benchmark scores relative to pure regimes (Yu et al., 2024).

6. Broader Implications and Future Research

Application of AntLM’s alternating regime is not limited to LTG-BERT. The paradigm is compatible with encoder-only LMs such as RoBERTa, DeBERTa, and SpanBERT, and potentially with encoder-decoder hybrids (e.g., T5) by flexible switching of attention masks and objectives. Further, finer-grained alternation strategies (e.g., batch-level or adaptive $\alpha(t)$ ) are plausible routes for future exploration. Extension to multilingual or multimodal contexts may similarly benefit from objective alternation. The synergy between CLM and MLM indicates that no singular objective is optimal for all linguistic phenomena; thus, hybrid approaches are emerging as a prominent method for data-efficient and robust pretraining (Yu et al., 2024).