Papers
Topics
Authors
Recent
2000 character limit reached

ModernBERT-large: Long-Context Encoder

Updated 27 November 2025
  • ModernBERT-large is a family of encoder-only Transformer models optimized for long-context natural language understanding with support up to 16k tokens.
  • It integrates advanced techniques such as rotary positional embeddings, GeGLU nonlinearity, and alternating local/global attention to enhance efficiency.
  • Pretrained on diverse, large-scale corpora, ModernBERT-large achieves state-of-the-art results across NLU, biomedical, and code-related tasks.

ModernBERT-large denotes a family of contemporary, encoder-only Transformer architectures purpose-built for efficient, high-quality natural language understanding (NLU) and retrieval with native support for long-context sequences (up to 8 192–16 384 tokens). ModernBERT-large models are architecturally distinguished by incremental yet impactful enhancements over classic BERT: rotary positional embeddings (RoPE), GeGLU nonlinearity, alternating local/global efficient attention patterns, and systematic deployment of memory-saving kernels such as FlashAttention. These advances, combined with large, diverse corpora and BPE-based tokenization, establish ModernBERT-large as the state-of-the-art encoder backbone for a broad spectrum of languages and domains, including code, biomedical, clinical, Finnish, Japanese, and instruction-tuned English (Warner et al., 18 Dec 2024, Sugiura et al., 22 Apr 2025, Lee et al., 4 Apr 2025, Sounack et al., 12 Jun 2025, Reunamo et al., 12 Nov 2025, Clavié et al., 6 Feb 2025).

1. Architectural Foundations and Variants

ModernBERT-large architectures typically range from 24 to 28 Transformer layers with a hidden size of 1 024 and 16 self-attention heads. Key architectural features include:

Model #Layers Hidden Size FFN Dim #Heads Max len Params Core Attn
ModernBERT-large 28 1 024 2 624 16 8 192 395 M Local/Global, RoPE
Clinical ModernBERT 24 1 024 4 096 16 8 192 396 M Local/Global, RoPE
Finnish ModernBERT-large 28 1 024 2 624 16 16 384 401 M Local/Global, RoPE
Japanese ModernBERT-large 12 768 3 072 12 8 192 187 M Local/Global, RoPE

The architectural modifications directly address memory and compute bottlenecks in extended context, achieving efficient scaling on both consumer and datacenter hardware (Warner et al., 18 Dec 2024, Reunamo et al., 12 Nov 2025).

2. Pretraining Data Regimes and Tokenization

ModernBERT-large models are pretrained on massive and domain-diverse corpora, leveraging byte-pair encoding (BPE) tokenization strategies with vocabulary sizes optimized for the target language and domain.

  • General-domain: Web, books, code repositories, and scientific articles, amounting to up to 2 trillion tokens for the English-centric model (Warner et al., 18 Dec 2024).
  • Domain-specific: Biomedical and clinical ModernBERT variants are further tuned on PubMed abstracts, MIMIC-IV notes, and structured medical ontologies (ICD, CPT, RxNorm) (Lee et al., 4 Apr 2025, Sounack et al., 12 Jun 2025).
  • Multilingual and regional: Finnish ModernBERT-large utilizes 390 B tokens across Finnish, Swedish, English, code, and minor regional languages with data mixture and oversampling schedules (Reunamo et al., 12 Nov 2025). Japanese ModernBERT-large draws from 0.69 T Japanese tokens including web, Wikipedia, legal, and academic corpora (Sugiura et al., 22 Apr 2025).

Tokenization universally employs BPE or customized SentencePiece/BPE hybrids, with vocabulary sizes ranging from ∼50 000 to 99 574 subwords according to linguistic characteristics and deduplication strategies (Warner et al., 18 Dec 2024, Sugiura et al., 22 Apr 2025, Reunamo et al., 12 Nov 2025).

3. Long-Context Adaptations and Training Procedures

Expansion of native context length is core to ModernBERT-large's design. Models are trained (often in multiple stages) to maximize both short- and long-context coverage.

ModernBERT-large models routinely support up to 8k–16k context for both pretraining and inference, contrasting with the 512-token ceiling of classic BERT.

4. Empirical Performance and Evaluation

ModernBERT-large sets or approaches state-of-the-art results in classification, retrieval, and code search across NLU, biomedical, and code-centric tasks.

Task ModernBERT-large Baseline Relative Position
GLUE (accuracy, avg) 92.1 BERT-large(85.6) Highest (NLU)
BEIR Retrieval (nDCG) 44.0 BERT-large(38.9) SOTA (DPR)
StackQA (MRR@10) 83.9 RoBERTa-large(69) Highest (code)
MLDR (nDCG@10 FI) 0.343 RoBERTa-base(0.226) SOTA (Finnish)
JGLUE (JP, avg) 89.2 best baseline 91.6 Competitive

In biomedical and clinical settings, BioClinical ModernBERT-large achieved new state-of-the-art median F₁ on ChemProt (90.8) and Phenotype (60.8), with highest NER scores on DEID (83.8) (Sounack et al., 12 Jun 2025). On clinical NER, Clinical ModernBERT matched or surpassed prior bests on i2b2 2012/2014 datasets (Lee et al., 4 Apr 2025). Japanese ModernBERT-large exhibited strong fill-mask accuracy (top-1 ≥80%) on cloze tasks, though it did not surpass larger baselines on sentence-level evaluations (Sugiura et al., 22 Apr 2025).

Zero-shot and generative instruction-tuned ModernBERT-large exhibits clearly superior MMLU accuracy (43.1% at 0.4B params), substantially outperforming comparably sized decoder LLMs (Clavié et al., 6 Feb 2025).

5. Efficiency, Memory, and Inference Dynamics

ModernBERT-large is engineered for high-speed inference and memory efficiency, validated on commodity and datacenter GPUs:

6. Sentence Embedding Behavior and Semantic Analysis

Mean-pooled sentence embeddings from ModernBERT-large align with expected trends for robust encoders (Sugiura et al., 22 Apr 2025):

  • Alignment (ℓ_{align}) and Uniformity (ℓ_{uniform}) metrics track the separation/spread of positive/example pairs and global vector dispersal, respectively.
  • During pretraining, uniformity increases rapidly, while alignment degrades, then reach stable oscillation bounded by the "ModernBERT trajectory."
  • Cosine similarity histograms show that extended pretraining eventually leads to greater overlap between example categories, mirroring other ModernBERT clones.
  • In Clinical ModernBERT, t-SNE projections of code embeddings exhibit semantically coherent clustering, attributable to ontology-aware masking (Lee et al., 4 Apr 2025).

7. Reproducibility, Model Access, and Deployment Guidance

ModernBERT-large and its principal variants are fully open, with model weights and code repositories publicly accessible.

Model/Variant Weights (HuggingFace) Training Code (GitHub)
English ModernBERT-large "allenai/modernbert-large" github.com/allenai/modern-bert
Japanese ModernBERT-large LLM-jp/LLM-jp-modernbert-base github.com/LLM-jp/LLM-jp-modernbert
Clinical/BioClinical lindvalllab/BioClinical-ModernBERT github.com/lindvalllab/BioClinical-ModernBERT
Finnish ModernBERT-large (see (Reunamo et al., 12 Nov 2025)) github.com/huggingface/transformers

Recommended deployment specifies PyTorch ≥2.x, HuggingFace Transformers, FlashAttention, and GPU hardware appropriate for the desired sequence length (24/40/80 GB VRAM for 2k/4k/8k tokens, respectively) (Warner et al., 18 Dec 2024, Sugiura et al., 22 Apr 2025, Reunamo et al., 12 Nov 2025). Fine-tuning batch sizes, learning rates, and context-length strategies are well-documented per variant and use case.


ModernBERT-large thus defines the prevailing paradigm in long-context encoder models across languages and domains, structurally characterized by RoPE, alternate attention, and memory-efficient design. It achieves state-of-the-art or near state-of-the-art results on evaluation suites spanning NLU, biomedical/clinical text, retrieval, and code, while maintaining hardware scalability and algorithmic transparency (Warner et al., 18 Dec 2024, Sugiura et al., 22 Apr 2025, Lee et al., 4 Apr 2025, Sounack et al., 12 Jun 2025, Reunamo et al., 12 Nov 2025, Clavié et al., 6 Feb 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ModernBERT-large.