AntLM: Alternating CLM and MLM

Updated 19 November 2025

AntLM is a composite language modeling paradigm that alternates between Causal Language Modeling and Masked Language Modeling to harness fast convergence and improved semantic gains.
It employs an epoch-level alternation schedule with distinct attention masks, enabling seamless integration into standard Transformer architectures without modifications.
Empirical evaluations on the BabyLM Challenge reveal balanced performance improvements across benchmarks using models like BabyLlama and LTG-BERT.

AntLM is a composite language modeling paradigm designed to bridge the strengths of Causal Language Modeling (CLM) and Masked Language Modeling (MLM) on Transformer architectures. Unlike approaches that mix CLM and MLM losses simultaneously, AntLM alternates them at the epoch level, toggling self-attention masks accordingly. Leveraging this alternation, AntLM aims to combine the rapid convergence characteristic of CLM with the semantic gains of MLM. This framework was evaluated on the BabyLM Challenge 2024 strict-small track (10M words of child-accessible text) using two architectures: BabyLlama (CLM, decoder-only) and LTG-BERT (MLM, encoder-only). Across various linguistic benchmarks, AntLM produced improved macro-average scores relative to single-paradigm baselines while remaining compatible as a drop-in modification for standard Transformer models (Yu et al., 2024).

1. Training Objectives

AntLM combines two canonical unsupervised objectives—CLM and MLM—by alternating them across contiguous blocks of training epochs. The formal definitions are as follows:

CLM Loss: $L_{CLM}(\theta) = - \sum_{t=1}^T \log p_\theta(x_t\,|\,x_{<t})$ , with a lower-triangular (“causal”) attention mask restricting attention to preceding tokens only.
MLM Loss: $L_{MLM}(\theta) = - E_M\Big[\sum_{i\in M} \log p_\theta(x_i\,|\,x_{\neg M})\Big]$ , where $M$ is a random 15% subset of tokens, with masked tokens (80%), random replacements (10%), and unchanged (10%). Bidirectional attention is fully enabled.

In AntLM, no additional scalar weighting is introduced; each epoch is assigned entirely to either CLM or MLM. The total objective is the sum over CLM and MLM epochs: $\min \sum_{e \in E_C} L_{CLM}(\theta) + \sum_{e \in E_M} L_{MLM}(\theta)$ .

2. Epoch Alternation Schedule

AntLM partitions training epochs into blocks dedicated to a single objective and mask. The alternating structure is denoted compactly—e.g., “ $x_C + y_M + …$ ” indicates $x$ CLM epochs, then $y$ MLM epochs, etc. Empirically determined optimal schedules for the BabyLM strict-small track (10M words) are:

Model	Epoch Schedule	Total Epochs
BabyLlama	4_C + 16_M + 4_C	24
LTG-BERT	6_C + 60_M + 6_C	72

During a CLM block, the input is uncorrupted and causal masking is used; during an MLM block, 15% of tokens are selected for the cloze task and bidirectional masking is activated. Hyperparameter ablation indicates that alternation order and frequency are relatively stable, but employing short CLM blocks at the beginning and end (“bookends”) produces superior results compared to pure or frequently alternating regimes.

3. Architectural Implementation

AntLM is architecture-agnostic with respect to the class of Transformer models that support flexible attention masking. The two evaluated models are:

BabyLlama (Decoder-only, 97M parameters): 12 Transformer layers, 12 attention heads, hidden size 768, intermediate size 2048, vocabulary 16K, sinusoidal positional encoding, Gelu activation. No weight modification is required; only the self-attention mask is swapped between causal and full across epochs.
LTG-BERT (Encoder-only, ~100M parameters): 12 Transformer layers, 12 heads, hidden size 768, intermediate 2048, vocabulary 16K, 32 relative-position buckets, NormFormer normalization, DeBERTa-style disentangled relative attention, GLU activation. Similarly, the architecture is unchanged between epochs, toggling only the mask and loss objective.

No distillation or additional layers are introduced for AntLM; all changes operate at the attention mask and loss selection level.

4. Training Setup and Data Preprocessing

Experiments employ the BabyLM 2024 strict-small track, comprising 10M words of child-accessible texts. Preprocessing involves punctuation normalization, sentence re-segmentation, and duplicate removal (from BootBERT). The optimizer is AdamW (bfloat16 precision).

BabyLlama: Batch size 512, initial learning rate $7\times10^{-4}$ . CLM epochs use a cosine-decay schedule; MLM epochs use cosine with restarts (every 4 epochs).
LTG-BERT: Batch size 1024, initial learning rate $5\times10^{-4}$ , cosine with restarts (4-epoch cycles) throughout.

Schedules conform to the ablation-optimized blocks described above.

5. Empirical Results

AntLM demonstrates clear performance improvements over baselines trained solely with CLM or MLM objectives across four benchmark suites: BLiMP, BLiMP_Supplement, EWoK, and GLUE. Macro-average scores are as follows:

Model/Training	Macro Avg	BLiMP	BLiMP_Suppl	EWoK	GLUE
BabyLlama (baseline)	61.1	68.1	60.4	50.4	65.5
AntLM_BabyLlama	62.1	69.4	60.7	51.1	67.4
LTG-BERT (baseline)	63.8	62.6	65.4	62.3	64.9
AntLM_LTG-BERT	66.0	72.3	62.6	63.0	66.0

Notable empirical observations include:

CLM epochs drive rapid early BLiMP gains, indicative of sequence modeling proficiency.
MLM epochs preferentially enhance EWoK and GLUE benchmarks, reflecting improvement in semantic comprehensiveness.
The alternation yields more robust, balanced performance across tasks than either pure paradigm alone.
No formal statistical significance tests were reported.

6. Analytical Considerations and Practical Implications

CLM and MLM exert complementary regularities during training. CLM enforces token-by-token autoregressive prediction, favoring rapid convergence. MLM, in contrast, enforces bidirectional context reconstruction and semantic understanding. Alternating CLM and MLM objectives lets the model "practice" both cloze-style comprehension and open-ended generation.

Ablation studies indicate the method's robustness to fine-grained schedule variants, with “CLM bookends” consistently outperforming single-paradigm configurations. The finding that AntLM yields gains on small datasets (10M words) suggests that joint objective alternation can extract richer representations from limited corpora, a plausible implication for efficient self-supervised language learning in human-like settings.

Practically, AntLM is a drop-in method requiring only runtime switching between causal and bidirectional masks and loss objectives; no retraining or architectural modification is needed. Scaling to larger corpora, mixing further objectives, and investigating paradigm-conferred “linguistic knowledge” remain open directions for future work (Yu et al., 2024).

PDF Markdown Chat (Pro)

References (1)

AntLM: Bridging Causal and Masked Language Models (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to AntLM.

AntLM: Alternating CLM and MLM

1. Training Objectives

2. Epoch Alternation Schedule

3. Architectural Implementation

4. Training Setup and Data Preprocessing

5. Empirical Results

6. Analytical Considerations and Practical Implications

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AntLM: Alternating CLM and MLM

1. Training Objectives

2. Epoch Alternation Schedule

3. Architectural Implementation

4. Training Setup and Data Preprocessing

5. Empirical Results

6. Analytical Considerations and Practical Implications

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research