AntLM BabyLlama: Hybrid Pretraining Model

Updated 19 November 2025

AntLM BabyLlama is a hybrid modeling paradigm that alternates between causal and masked language modeling within a single decoder-only Transformer to harness both generative and bidirectional strengths.
It employs a piecewise epoch schedule (4 CLM + 16 MLM + 4 CLM) to achieve measurable performance improvements in resource-constrained 10M-token pretraining regimes.
The model demonstrates that simple attention mask manipulation can bridge the gap between efficient sequence generation and robust context understanding without modifying the underlying architecture.

AntLM $_{\text{BabyLlama}}$ is a hybrid language modeling paradigm that alternates between causal language modeling (CLM) and masked language modeling (MLM) within a single decoder-only Transformer architecture. Developed in the context of the BabyLM Challenge 2024 strict-small track, AntLM $_{\text{BabyLlama}}$ leverages the BabyLlama model as its base and demonstrates that alternating CLM and MLM objectives—along with corresponding attention masking schemes—yields measurable improvements over pure-CLM pretraining, specifically for resource-constrained (10M token) regimes. The approach is motivated by the complementary properties of the CLM and MLM paradigms, with CLMs excelling in generative pretraining efficiency but limited by their uni-directional context, and MLMs learning robust bidirectional representations but converging more slowly and being ill-suited for text generation (Yu et al., 2024).

1. Theoretical Motivation and Problem Setting

AntLM $_{\text{BabyLlama}}$ is predicated on the empirical observation that CLMs and MLMs possess orthogonal strengths. CLMs, instantiated as autoregressive, decoder-only Transformers (e.g., GPT-style), predict each token $x_t$ using only preceding tokens $x_1, \ldots, x_{t-1}$ ; they display rapid pretraining convergence and excel at next-token prediction and coherent sequence generation. However, the inability to access right-hand context inherently limits CLM effectiveness on tasks requiring bidirectional comprehension, such as cloze and various classification benchmarks.

In contrast, MLMs (BERT-style, encoder-only) mask a subset $\mathcal{M}$ of tokens in an input $x_1, ..., x_T$ and train the model to predict $x_i$ for each $i \in \mathcal{M}$ , with access to both left and right contexts. MLMs thus acquire bidirectional representations conducive to understanding tasks but forego efficient sequence generation and exhibit slower convergence on small data. AntLM $_{\text{BabyLlama}}$ aims to integrate these regimes, enabling a decoder-only Transformer to absorb both generative and bidirectional strengths (Yu et al., 2024).

2. Model Architecture, Training Objectives, and Scheduling

AntLM $_{\text{BabyLlama}}$ uses the BabyLlama architecture: a decoder-only Transformer with 12 layers, 12 attention heads per layer, hidden size 768, intermediate size 2048, and a fixed vocabulary of 16k. The architectural details (GLU feed-forward activations, positional encoding) match the original BabyLlama specification.

Parameter unification is achieved by sharing all weights for both objectives. During pretraining, AntLM alternates between two regimes:

CLM Phase:
- Attention mask: strictly lower-triangular ( $M_{i,j}=1$ iff $j\leq i$ )
- Loss:
$\mathcal{L}_{\text{Causal}} = -\sum_{t=1}^{T} \log P(x_t | x_{<t})$
MLM Phase:
- Attention mask: full (bidirectional, all positions attend to all others)
- Input: Approximately 15% of tokens masked per BERT protocol ([MASK], random, or unchanged at 80/10/10%)
- Loss:
$\mathcal{L}_{\text{Masked}} = -\sum_{i\in \mathcal{M}} \log P(x_i | x_{\not\in \mathcal{M}})$

The explicit AntLM loss can be written as: $\mathcal{L}_{\text{AntLM}} = \alpha\, \mathcal{L}_{\text{Causal}} + (1-\alpha)\, \mathcal{L}_{\text{Masked}}$ In practice, AntLM $_{\text{BabyLlama}}$ uses a piecewise epoch-based schedule: entire epochs are devoted to one regime ( $\alpha=1$ for CLM epochs, $\alpha=0$ for MLM epochs). The empirically optimal curriculum is 4 epochs of CLM, followed by 16 MLM epochs, then a final 4 CLM epochs (4_CLM + 16_MLM + 4_CLM over a fixed 24-epoch, 10M-token regime).

3. Attention Mask Manipulation and Implementation Details

Attention masking swaps at epoch boundaries:

Causal Mask (CLM):

Tokens in position $t$ attend to positions $1\ldots t$ only.

Bidirectional Mask (MLM):

Standard Transformer self-attention; masked tokens are replaced and not visible at output.

Training loop pseudocode:

for epoch in range(1, E_total+1):
    if epoch in CLM_epochs:
        mask = causal_lower_triangle
        input_tokens = full_sequence
        loss = L_Causal(input_tokens, mask)
    else:
        mask = full_matrix
        input_tokens = apply_bert_masking(sequence)
        loss = L_Masked(input_tokens, mask)
    backpropagate(loss)
    update(AdamW)

Key hyperparameters: batch size 512, AdamW optimizer (initial LR

7\times10^{-4}

), cosine decay (CLM) or cosine-with-restarts (MLM) learning rate scheduling, data type bfloat16, preprocessing with the BootBERT pipeline (punctuation normalization, deduplication, sentence reconstruction).

4. Experimental Evaluation and Benchmark Performance

Evaluation is carried out on BabyLM2024’s four standard benchmarks:

BLiMP: zero-shot syntactic generalization
BLiMP Supplement: further syntactic control tasks
EWoK: world knowledge evaluation
GLUE: language understanding, sentence- and document-level tasks

Macro-average is defined as the mean of the per-benchmark results.

Empirical results under 24 epochs and 10M tokens:

Model	BLiMP	BLiMP_S	EWoK	GLUE	Macro-Avg
BabyLlama (baseline)	68.1	60.4	50.4	65.5	61.1
AntLM $_{\text{BabyLlama}}$	69.4	60.7	51.1	67.4	62.1

AntLM $_{\text{BabyLlama}}$ yields a consistent +1.0 percentage point improvement in macro-average. BLiMP, EWoK, and GLUE scores all exhibit increases; BLiMP Supplement improves slightly. Learning-curve analysis demonstrates that generative scoring advances rapidly in CLM regimes, while MLM phases contribute to robust bidirectional representations.

Ablation studies confirm that the 4_CLM+16_MLM+4_CLM epoch schedule outperforms pure-CLM, pure-MLM, and other interleavings.

5. Comparative Analysis and Practical Implications

Contrasted with pure-CLM or MLM approaches, AntLM $_{\text{BabyLlama}}$ achieves both efficient convergence and strong cross-benchmark performance. The regime is lightweight: it requires no architectural changes to BabyLlama, only manipulation of the attention mask and loss function during pretraining.

A practical implication is that this procedure enables a single decoder-only Transformer to avoid the tradeoff between generation-ready modeling and representation-rich masked training. The method is particularly advantageous in data-constrained scenarios (e.g., the strict-small 10M-token track) where maximizing representational yield per sample is critical.

6. Open Directions and Extensions

AntLM $_{\text{BabyLlama}}$ opens several avenues for research:

Generalizing the epoch slicing into continuous $\alpha\in(0,1)$ mixing during training.
Application to larger corpora (e.g., BabyLM 100M track).
Integration with curriculum learning or dynamic scheduling of CLM/MLM based on convergence or task signals.
Transfer to encoder-decoder (seq2seq) and multilingual architectures.

This suggests that “alternating mask-and-objective” pretraining may have broader applicability beyond small-scale LLM pretraining, especially where resource or deployment constraints favor decoder-only architectures (Yu et al., 2024).

7. Significance and Context Within LLM Research

AntLM $_{\text{BabyLlama}}$ embodies an emerging research direction: integrating multiple self-supervised objectives within a unified architecture by scheduling, rather than ensemble or multitask approaches. Unlike teacher-student distillation (as in BabyLlama-based mode-seeking or mode-averaging via KL divergences (Shi et al., 2024)), AntLM $_{\text{BabyLlama}}$ directly modifies the training regime. The method reaffirms that parameter sharing and judicious scheduling can bridge distinct language modeling paradigms (“two birds, one stone”), with the potential to inform future pretraining paradigms for general language understanding and generation.