NeoBERT: Next-Gen Transformer Encoder
- NeoBERT is a transformer-based bidirectional encoder featuring 28 layers, SwiGLU activation, and RoPE for robust positional encoding to set new SOTA benchmarks.
- It leverages a two-stage pre-training regime on a 600B token RefinedWeb corpus, extending context windows up to 4,096 tokens while maintaining a compact parameter footprint.
- Empirical evaluations on GLUE and MTEB demonstrate that NeoBERT outperforms traditional BERT variants, offering improved accuracy and efficiency in language understanding tasks.
NeoBERT is a next-generation transformer-based bidirectional encoder model designed to bridge the divergence in recent progress between auto-regressive LLMs and classical masked language encoders. By integrating contemporary architectural choices, modern large-scale filtered web corpora, and optimized pre-training and fine-tuning protocols, NeoBERT establishes new state-of-the-art performance on both GLUE and MTEB suite benchmarks with a compact parameter footprint and drop-in compatibility with prevalent BERT-based applications (Breton et al., 26 Feb 2025).
1. Model Architecture and Scaling
NeoBERT adopts a BERT-sized hidden dimension () and increases depth to 28 transformer layers, yielding a core parameter count of approximately $198$M for the transformer block and a total of $250$M after incorporating embeddings, normalization, and classifier heads. The model utilizes 12 attention heads per layer and a feed-forward intermediate size of when employing the SwiGLU (Gated Linear Unit) activation. This configuration preserves the parameter budget while switching from GeLU (standard in BERT) to SwiGLU, which requires three projection matrices. Normalization is handled via Pre-LayerNorm with RMSNorm, supplanting the canonical Post-LayerNorm scheme.
For positional encoding, NeoBERT employs Rotary Position Embeddings (RoPE) for robust relative position modeling, and is compatible with YaRN context extension mechanisms. The context window is extended up to $4,096$ tokens via a two-stage pre-training regime. Initially, NeoBERT is trained at a maximum length of $1,024$, followed by further training at $4,096$ using long-sequence subset sampling to mitigate distributional shift.
2. Pre-Training Regime
NeoBERT's pre-training utilizes the RefinedWeb corpus, a filtered variant of CommonCrawl, with an aggregate of approximately $600$B tokens ( TB raw), and a 30K token WordPiece vocabulary identical to BERT. Total pre-training exposure spans about $2.1$T tokens. The objective is masked language modeling (MLM) with a 20% masking rate where all masked tokens are replaced by [MASK].
Optimization relies on AdamW with , , , weight decay 0.1, and gradient clipping norm 1.0. Learning rate is subject to a $2$K-step linear warmup to , followed by cosine decay to 10% over 90% of steps, with constant scheduling thereafter.
Resource efficiency is enabled through DeepSpeed ZeRO for distributed training, FlashAttention for memory-efficient attention computation, and xFormers fused kernels to optimize small tensor operations and favor multiples-of-64 shapes.
3. Empirical Evaluation and Ablation
Ablation studies on GLUE with ten models (–), each altering a single design aspect, attribute the largest performance gain (+3.6% GLUE) to the corpora shift to RefinedWeb. Other significant effects result from model size scaling (+2.9%), context window and batch size increase (+1.5%), and architectural modifications such as RoPE, SwiGLU, and RMSNorm (+0.8%). Replacing the tokenizer with LLaMA BPE and aggressive sequence packing negatively impact GLUE scores ( and , respectively). All ablation baseline models were trained for $1$M steps with identical seeds.
4. Fine-Tuning Methodology and Benchmark Results
GLUE (General Language Understanding Evaluation)
Uniform task-specific fine-tuning was applied, searching across batch sizes , learning rates , and weight decay . Transfer learning initializes four GLUE tasks from the best MNLI checkpoint.
| Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| BERT | 84.0 | 90.5 | 71.2 | 66.4 | 93.5 | 88.9 | 52.1 | 85.8 | 79.6 |
| RoBERTa | 87.6 | 92.8 | 91.9 | 78.7 | 94.8 | 90.2 | 63.6 | 91.2 | 86.4 |
| ModernBERT | 89.1 | 93.9 | 92.1 | 87.4 | 96.0 | 92.2 | 65.1 | 91.8 | 88.5 |
| NeoBERT | 89.0 | 93.7 | 90.7 | 91.3 | 95.6 | 93.4 | 66.2 | 91.8 | 89.0 |
NeoBERT matches or surpasses the highest published GLUE scores for models in its size class and outperforms BERT, RoBERTa, and other contenders under identical fine-tuning protocols.
MTEB (Massive Text Embedding Benchmark, English Subset)
Fine-tuning employs mean pooling on final hidden states with cosine similarity for contrastive loss, using in-batch negatives and hard negatives over 9M query–document pairs. The loss function is
| Model | Class. | Clust. | PairCl. | Rerank. | Retriev. | STS | Summ. | Avg. |
|---|---|---|---|---|---|---|---|---|
| NeoBERT | 61.6 | 40.8 | 76.2 | 51.2 | 31.6 | 74.8 | 30.7 | 51.3 |
NeoBERT achieves the highest average score (51.3), ranking first in five of seven MTEB task types and outperforming BERT, RoBERTa, NomicBERT, and ModernBERT. Pseudo-perplexity indicates NeoBERT maintains low perplexity for input sequence lengths up to $4,096$, a threshold beyond which prior versions degrade.
5. Deployment and Resource Considerations
NeoBERT is designed as a plug-and-play base encoder with , allowing direct substitution for any BERT-compatible model architecture without code modifications. The model is released via Hugging Face and is compatible with Transformers v4.x. Training required approximately $6$k GPU-hours on 8×H100 accelerators for $1,050$k steps. Model throughput during inference benefits from FlashAttention optimization. NeoBERT's total parameter size ($250$M) is substantially smaller than RoBERTa ($355$M) and ModernBERT ($395$M).
Complete code, data preprocessing, training, and fine-tuning recipes are available, supporting reproducibility and further research.
6. Architectural and Methodological Innovations
NeoBERT leverages a combination of innovations: optimal depth-to-width scaling (28 layers × 768 hidden), architectural upgrades (RoPE for flexibility in position encoding, Pre-LN+RMSNorm for training stability, SwiGLU activation for efficient feed-forward modeling), and a two-phase extended-context training protocol. The decision to adopt RefinedWeb as the primary pre-training corpus delivers the single largest empirical performance boost among ablation factors. NeoBERT does not employ sequence packing, which was found to reduce GLUE performance.
A uniform and rigorous fine-tuning framework allows for direct comparability with prior baselines on both GLUE and the extensive task types encompassed by the MTEB benchmark.
7. Significance and Accessibility
NeoBERT demonstrates that bidirectional transformer encoders can match or exceed large auto-regressive models and prior encoder variants on comprehensive language understanding and text embedding tasks, while using fewer parameters. The model's open-source status, systematic ablation, and detailed reproducibility documentation provide broad utility for academic and industrial adoption, and may serve as a modern baseline for both encoder-specific NLP tasks and research into scalable, context-extended representations (Breton et al., 26 Feb 2025).