NeoBERT: A Next-Generation BERT (2502.19587v2)

Published 26 Feb 2025 in cs.CL and cs.AI

Abstract: Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive LLMs such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.

Summary

The paper introduces NeoBERT, a next-generation bidirectional language model featuring strategic architectural refinements and a comprehensive two-stage pre-training strategy for improved efficiency and context modeling.
NeoBERT utilizes a two-stage training approach on a massive 2.1T token dataset, effectively extending the maximum sequence length handling capability from 1,024 to 4,096 tokens.
Evaluations show that despite its compact 250M parameter size, NeoBERT achieves state-of-the-art performance on GLUE and MTEB benchmarks, delivering significant relative accuracy gains and faster inference speeds at longer sequences.

NeoBERT redefines bidirectional LLM encoders through precise architectural refinements and a comprehensive two-stage pre-training regime.

The design strategically rebalances the depth-to-width ratio by increasing depth from 16 to 28 layers while fixing hidden dimensions at 768 and replacing standard components with RoPE, RMSNorm, and SwiGLU for enhanced parameter efficiency and longer context modeling.
It is pre-trained from scratch on a massive RefinedWeb dataset with 600B tokens, and a two-stage training strategy extends the maximum sequence length from 1,024 to 4,096 tokens over a total of 2.1T tokens.
Evaluations on GLUE and MTEB benchmarks reveal that, despite a compact 250M parameter footprint, NeoBERT achieves state-of-the-art performance—evidenced by a +4.5% relative gain over leading models and a 46.7% inference speedup at 4,096 token sequences.

PDF Markdown

Tweets

https://twitter.com/jxmnop/status/1937515622562431427

https://twitter.com/abanerjee99/status/1938017067770524159

https://twitter.com/TheTuringPost/status/1897242795687170413

https://twitter.com/helloiamleonie/status/1937905887810630103

https://twitter.com/BioGeek/status/1912494472249626744

https://twitter.com/semisance/status/1895420045028975084

HackerNews

NeoBERT: A Next-Generation Bert (2 points, 0 comments)

NeoBERT: A Next-Generation BERT (2502.19587v2)

Summary

Related Papers

Tweets

HackerNews