AntLM: Alternating CLM and MLM
- AntLM is a composite language modeling paradigm that alternates between Causal Language Modeling and Masked Language Modeling to harness fast convergence and improved semantic gains.
- It employs an epoch-level alternation schedule with distinct attention masks, enabling seamless integration into standard Transformer architectures without modifications.
- Empirical evaluations on the BabyLM Challenge reveal balanced performance improvements across benchmarks using models like BabyLlama and LTG-BERT.
AntLM is a composite language modeling paradigm designed to bridge the strengths of Causal Language Modeling (CLM) and Masked Language Modeling (MLM) on Transformer architectures. Unlike approaches that mix CLM and MLM losses simultaneously, AntLM alternates them at the epoch level, toggling self-attention masks accordingly. Leveraging this alternation, AntLM aims to combine the rapid convergence characteristic of CLM with the semantic gains of MLM. This framework was evaluated on the BabyLM Challenge 2024 strict-small track (10M words of child-accessible text) using two architectures: BabyLlama (CLM, decoder-only) and LTG-BERT (MLM, encoder-only). Across various linguistic benchmarks, AntLM produced improved macro-average scores relative to single-paradigm baselines while remaining compatible as a drop-in modification for standard Transformer models (Yu et al., 4 Dec 2024).
1. Training Objectives
AntLM combines two canonical unsupervised objectives—CLM and MLM—by alternating them across contiguous blocks of training epochs. The formal definitions are as follows:
- CLM Loss: , with a lower-triangular (“causal”) attention mask restricting attention to preceding tokens only.
- MLM Loss: , where is a random 15% subset of tokens, with masked tokens (80%), random replacements (10%), and unchanged (10%). Bidirectional attention is fully enabled.
In AntLM, no additional scalar weighting is introduced; each epoch is assigned entirely to either CLM or MLM. The total objective is the sum over CLM and MLM epochs: .
2. Epoch Alternation Schedule
AntLM partitions training epochs into blocks dedicated to a single objective and mask. The alternating structure is denoted compactly—e.g., “” indicates CLM epochs, then MLM epochs, etc. Empirically determined optimal schedules for the BabyLM strict-small track (10M words) are:
| Model | Epoch Schedule | Total Epochs |
|---|---|---|
| BabyLlama | 4_C + 16_M + 4_C | 24 |
| LTG-BERT | 6_C + 60_M + 6_C | 72 |
During a CLM block, the input is uncorrupted and causal masking is used; during an MLM block, 15% of tokens are selected for the cloze task and bidirectional masking is activated. Hyperparameter ablation indicates that alternation order and frequency are relatively stable, but employing short CLM blocks at the beginning and end (“bookends”) produces superior results compared to pure or frequently alternating regimes.
3. Architectural Implementation
AntLM is architecture-agnostic with respect to the class of Transformer models that support flexible attention masking. The two evaluated models are:
- BabyLlama (Decoder-only, 97M parameters): 12 Transformer layers, 12 attention heads, hidden size 768, intermediate size 2048, vocabulary 16K, sinusoidal positional encoding, Gelu activation. No weight modification is required; only the self-attention mask is swapped between causal and full across epochs.
- LTG-BERT (Encoder-only, ~100M parameters): 12 Transformer layers, 12 heads, hidden size 768, intermediate 2048, vocabulary 16K, 32 relative-position buckets, NormFormer normalization, DeBERTa-style disentangled relative attention, GLU activation. Similarly, the architecture is unchanged between epochs, toggling only the mask and loss objective.
No distillation or additional layers are introduced for AntLM; all changes operate at the attention mask and loss selection level.
4. Training Setup and Data Preprocessing
Experiments employ the BabyLM 2024 strict-small track, comprising 10M words of child-accessible texts. Preprocessing involves punctuation normalization, sentence re-segmentation, and duplicate removal (from BootBERT). The optimizer is AdamW (bfloat16 precision).
- BabyLlama: Batch size 512, initial learning rate . CLM epochs use a cosine-decay schedule; MLM epochs use cosine with restarts (every 4 epochs).
- LTG-BERT: Batch size 1024, initial learning rate , cosine with restarts (4-epoch cycles) throughout.
Schedules conform to the ablation-optimized blocks described above.
5. Empirical Results
AntLM demonstrates clear performance improvements over baselines trained solely with CLM or MLM objectives across four benchmark suites: BLiMP, BLiMP_Supplement, EWoK, and GLUE. Macro-average scores are as follows:
| Model/Training | Macro Avg | BLiMP | BLiMP_Suppl | EWoK | GLUE |
|---|---|---|---|---|---|
| BabyLlama (baseline) | 61.1 | 68.1 | 60.4 | 50.4 | 65.5 |
| AntLM_BabyLlama | 62.1 | 69.4 | 60.7 | 51.1 | 67.4 |
| LTG-BERT (baseline) | 63.8 | 62.6 | 65.4 | 62.3 | 64.9 |
| AntLM_LTG-BERT | 66.0 | 72.3 | 62.6 | 63.0 | 66.0 |
Notable empirical observations include:
- CLM epochs drive rapid early BLiMP gains, indicative of sequence modeling proficiency.
- MLM epochs preferentially enhance EWoK and GLUE benchmarks, reflecting improvement in semantic comprehensiveness.
- The alternation yields more robust, balanced performance across tasks than either pure paradigm alone.
- No formal statistical significance tests were reported.
6. Analytical Considerations and Practical Implications
CLM and MLM exert complementary regularities during training. CLM enforces token-by-token autoregressive prediction, favoring rapid convergence. MLM, in contrast, enforces bidirectional context reconstruction and semantic understanding. Alternating CLM and MLM objectives lets the model "practice" both cloze-style comprehension and open-ended generation.
Ablation studies indicate the method's robustness to fine-grained schedule variants, with “CLM bookends” consistently outperforming single-paradigm configurations. The finding that AntLM yields gains on small datasets (10M words) suggests that joint objective alternation can extract richer representations from limited corpora, a plausible implication for efficient self-supervised language learning in human-like settings.
Practically, AntLM is a drop-in method requiring only runtime switching between causal and bidirectional masks and loss objectives; no retraining or architectural modification is needed. Scaling to larger corpora, mixing further objectives, and investigating paradigm-conferred “linguistic knowledge” remain open directions for future work (Yu et al., 4 Dec 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free