Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (2412.13663v2)

Published 18 Dec 2024 in cs.CL and cs.AI

Abstract: Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

Authors (14)

Benjamin Warner (2 papers)
Antoine Chaffin (13 papers)
Benjamin Clavié (12 papers)
Orion Weller (31 papers)
Oskar Hallström (1 paper)
Said Taghadouini (1 paper)
Alexis Gallagher (1 paper)
Raja Biswas (1 paper)
Faisal Ladhak (31 papers)
Tom Aarsen (2 papers)
Nathan Cooper (35 papers)
Griffin Adams (14 papers)
Jeremy Howard (7 papers)
Iacopo Poli (18 papers)

Summary

An Overview of "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference"

The paper "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference" introduces ModernBERT, an advanced encoder-only transformer model, which serves as a significant improvement over existing models such as BERT and RoBERTa. By integrating modern architectural enhancements and leveraging extensive pretraining datasets, ModernBERT excels in both performance efficiency and task versatility.

Key Contributions

ModernBERT is evolved with a series of architectural and efficiency improvements:

Architectural Enhancements:
- The standard transformer architecture was refined with recent advancements such as GeGLU layers, rotary positional embeddings (RoPE), and alternating layer attention, where both global and local attention mechanisms were used strategically. This design facilitates both short and long-context processing.
Efficiency Improvements:
- The model adopts Flash Attention and leverages unpadding techniques, which remove padding tokens across sequences to optimize computational resources. Such implementations have been shown to significantly boost throughput by 10-20%.
Model Design:
- Designed with infrastructural compatibility in mind, ModernBERT adopts a hardware-efficient model architecture by ensuring optimal tiling across GPU tensor cores. Additionally, the architecture was crafted to align with the specific architectural demands of common GPUs.
Extensive Pretraining:
- Trained on a vast dataset comprising 2 trillion tokens, ModernBERT not only handles English textual data but also incorporates code data, contrasting with prior models that predominantly focused on text.

Empirical Results

ModernBERT demonstrates superior performance across a broad swath of benchmarks compared to its predecessors:

Natural Language Understanding (NLU): On GLUE tasks, ModernBERT-base surpassed all previous base models, including DeBERTaV3-base, the previously unchallenged leader since its release.
Information Retrieval (IR): In both single-vector (DPR) and multi-vector (ColBERT) settings, ModernBERT sets new state-of-the-art scores on the BEIR benchmark suite, underscoring its retrieval prowess in varying contexts.
Code Retrieval: The model's training on code data provides it with an advantage in programming-related tasks, as evidenced by top performance in CodeSearchNet benchmarks.
Long-Context Retrieval: Unlike many of its predecessors, ModernBERT retains efficiency and performance with longer-context scenarios, thanks to its enhanced sequence-length handling capability.

Implications and Future Directions

The design and results from ModernBERT have several implications for both practical applications and future AI research:

Scalability: ModernBERT's efficient architecture supports deployment in production environments with hardware constraints, which is particularly beneficial for applications relying on long-context reasoning or high-volume data processing.
Further Optimization: Given ModernBERT's modular architecture, future work could explore further tuning and integration of alternative training objectives, such as Replaced-Token-Detection, to potentially enhance classification tasks.
Cross-Domain Applications: By incorporating datasets beyond natural language, such as programming data, ModernBERT sets the stage for developing models capable of handling increasingly diverse datasets.
Adaptability: ModernBERT underscores the value of adaptable architectures capable of leveraging hardware-specific enhancements, pointing towards a trend in designing more hardware-aware models.

In conclusion, ModernBERT delineates an evolution in encoder-only models by optimizing performance and efficiency across a spectrum of NLP tasks. This positions ModernBERT as a valuable asset in deploying real-world applications that require both computational efficiency and comprehensive task execution.

Related Papers

Tweets

https://twitter.com/antoine_chaffin/status/1869785735601213596

https://twitter.com/bclavie/status/1869785520064344545

https://twitter.com/jeremyphoward/status/1869786186740576709

https://twitter.com/jeremyphoward/status/1869786263219515562

https://twitter.com/alexisgallagher/status/1870006181705724281

https://twitter.com/benjamin_warner/status/1869803230148079831

YouTube

Show All Videos

HackerNews

ModernBERT (3 points, 1 comment)

Reddit

"Smarter, Better, Faster, Longer (ModernBERT): A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference", Warner et al 2024 (7 points, 1 comment)