ModernBERT-large: Long-Context Encoder

Updated 27 November 2025

ModernBERT-large is a family of encoder-only Transformer models optimized for long-context natural language understanding with support up to 16k tokens.
It integrates advanced techniques such as rotary positional embeddings, GeGLU nonlinearity, and alternating local/global attention to enhance efficiency.
Pretrained on diverse, large-scale corpora, ModernBERT-large achieves state-of-the-art results across NLU, biomedical, and code-related tasks.

ModernBERT-large denotes a family of contemporary, encoder-only Transformer architectures purpose-built for efficient, high-quality natural language understanding (NLU) and retrieval with native support for long-context sequences (up to 8 192–16 384 tokens). ModernBERT-large models are architecturally distinguished by incremental yet impactful enhancements over classic BERT: rotary positional embeddings (RoPE), GeGLU nonlinearity, alternating local/global efficient attention patterns, and systematic deployment of memory-saving kernels such as FlashAttention. These advances, combined with large, diverse corpora and BPE-based tokenization, establish ModernBERT-large as the state-of-the-art encoder backbone for a broad spectrum of languages and domains, including code, biomedical, clinical, Finnish, Japanese, and instruction-tuned English (Warner et al., 18 Dec 2024, Sugiura et al., 22 Apr 2025, Lee et al., 4 Apr 2025, Sounack et al., 12 Jun 2025, Reunamo et al., 12 Nov 2025, Clavié et al., 6 Feb 2025).

1. Architectural Foundations and Variants

ModernBERT-large architectures typically range from 24 to 28 Transformer layers with a hidden size of 1 024 and 16 self-attention heads. Key architectural features include:

Rotary Positional Embeddings (RoPE): Position is encoded not by additive vectors, but by rotating elements of $Q$ , $K$ in the complex plane, supporting extrapolation to longer contexts (Warner et al., 18 Dec 2024, Lee et al., 4 Apr 2025).
GeGLU Feedforward Units: The feedforward network employs a Gated Linear Unit with GELU, formulated as $\text{GeGLU}(x) = (\mathrm{GELU}(xW_1)) \circ (xW_2)$ , improving learning dynamics and parameter efficiency (Warner et al., 18 Dec 2024, Lee et al., 4 Apr 2025).
Pre-LayerNorm and Bias-Free Linear Layers: LayerNorm precedes all sub-layers with biases typically removed, stabilizing training and gradient flow (Warner et al., 18 Dec 2024).
Alternating Attention: Most layers utilize local sliding-window attention ( $w\ll n$ ), while every third or so layer applies global attention, reducing the computational burden from $O(n^2)$ to $O(n \cdot w)$ in most layers (Warner et al., 18 Dec 2024, Lee et al., 4 Apr 2025, Sounack et al., 12 Jun 2025).
FlashAttention Kernels: Multi-head attention is computed in memory-efficient tiles, enabling exact, fused softmax–matmul with near-linear hardware memory scaling (Warner et al., 18 Dec 2024, Lee et al., 4 Apr 2025).

Model	#Layers	Hidden Size	FFN Dim	#Heads	Max len	Params	Core Attn
ModernBERT-large	28	1 024	2 624	16	8 192	395 M	Local/Global, RoPE
Clinical ModernBERT	24	1 024	4 096	16	8 192	396 M	Local/Global, RoPE
Finnish ModernBERT-large	28	1 024	2 624	16	16 384	401 M	Local/Global, RoPE
Japanese ModernBERT-large	12	768	3 072	12	8 192	187 M	Local/Global, RoPE

The architectural modifications directly address memory and compute bottlenecks in extended context, achieving efficient scaling on both consumer and datacenter hardware (Warner et al., 18 Dec 2024, Reunamo et al., 12 Nov 2025).

2. Pretraining Data Regimes and Tokenization

ModernBERT-large models are pretrained on massive and domain-diverse corpora, leveraging byte-pair encoding (BPE) tokenization strategies with vocabulary sizes optimized for the target language and domain.

General-domain: Web, books, code repositories, and scientific articles, amounting to up to 2 trillion tokens for the English-centric model (Warner et al., 18 Dec 2024).
Domain-specific: Biomedical and clinical ModernBERT variants are further tuned on PubMed abstracts, MIMIC-IV notes, and structured medical ontologies (ICD, CPT, RxNorm) (Lee et al., 4 Apr 2025, Sounack et al., 12 Jun 2025).
Multilingual and regional: Finnish ModernBERT-large utilizes 390 B tokens across Finnish, Swedish, English, code, and minor regional languages with data mixture and oversampling schedules (Reunamo et al., 12 Nov 2025). Japanese ModernBERT-large draws from 0.69 T Japanese tokens including web, Wikipedia, legal, and academic corpora (Sugiura et al., 22 Apr 2025).

Tokenization universally employs BPE or customized SentencePiece/BPE hybrids, with vocabulary sizes ranging from ∼50 000 to 99 574 subwords according to linguistic characteristics and deduplication strategies (Warner et al., 18 Dec 2024, Sugiura et al., 22 Apr 2025, Reunamo et al., 12 Nov 2025).

3. Long-Context Adaptations and Training Procedures

Expansion of native context length is core to ModernBERT-large's design. Models are trained (often in multiple stages) to maximize both short- and long-context coverage.

Context Extension Schedules: Standard training is performed first at moderate sequence lengths (e.g., 1 024 tokens) followed by further training at long context (e.g., up to 16 384 tokens for Finnish, 8 192 for core and domain-specific variants) (Warner et al., 18 Dec 2024, Reunamo et al., 12 Nov 2025, Sugiura et al., 22 Apr 2025).
Attention Optimization: Implementation combines alternating local/global attention, FlashAttention, and (optionally) unpadding to ensure memory and compute efficiency (Warner et al., 18 Dec 2024, Sounack et al., 12 Jun 2025).
Masking Strategies: Masked Language Modeling (MLM) is universal, with typical 30% mask rate (decayed in domain-specialized settings), and no Next Sentence Prediction (Warner et al., 18 Dec 2024, Sugiura et al., 22 Apr 2025, Sounack et al., 12 Jun 2025). Token-aware masking is applied to biomedical entities in domain-specific variants (Lee et al., 4 Apr 2025).
Training Schedules: AdamW or StableAdamW optimizers with trapezoidal warmup-stable-decay schedulers are standard. Checkpointing, full reproducibility scripts, and open source model releases are characteristic (Warner et al., 18 Dec 2024, Lee et al., 4 Apr 2025, Sugiura et al., 22 Apr 2025).
Hardware: Large-scale parallelization is standard (8–64 GPUs or MI250x), with reliance on PyTorch ≥2.x, FlashAttention, and DDP/ZeRO optimizers for massive token throughput (Warner et al., 18 Dec 2024, Reunamo et al., 12 Nov 2025).

ModernBERT-large models routinely support up to 8k–16k context for both pretraining and inference, contrasting with the 512-token ceiling of classic BERT.

4. Empirical Performance and Evaluation

ModernBERT-large sets or approaches state-of-the-art results in classification, retrieval, and code search across NLU, biomedical, and code-centric tasks.

Task	ModernBERT-large	Baseline	Relative Position
GLUE (accuracy, avg)	92.1	BERT-large(85.6)	Highest (NLU)
BEIR Retrieval (nDCG)	44.0	BERT-large(38.9)	SOTA (DPR)
StackQA (MRR@10)	83.9	RoBERTa-large(69)	Highest (code)
MLDR (nDCG@10 FI)	0.343	RoBERTa-base(0.226)	SOTA (Finnish)
JGLUE (JP, avg)	89.2	best baseline 91.6	Competitive

In biomedical and clinical settings, BioClinical ModernBERT-large achieved new state-of-the-art median F₁ on ChemProt (90.8) and Phenotype (60.8), with highest NER scores on DEID (83.8) (Sounack et al., 12 Jun 2025). On clinical NER, Clinical ModernBERT matched or surpassed prior bests on i2b2 2012/2014 datasets (Lee et al., 4 Apr 2025). Japanese ModernBERT-large exhibited strong fill-mask accuracy (top-1 ≥80%) on cloze tasks, though it did not surpass larger baselines on sentence-level evaluations (Sugiura et al., 22 Apr 2025).

Zero-shot and generative instruction-tuned ModernBERT-large exhibits clearly superior MMLU accuracy (43.1% at 0.4B params), substantially outperforming comparably sized decoder LLMs (Clavié et al., 6 Feb 2025).

5. Efficiency, Memory, and Inference Dynamics

ModernBERT-large is engineered for high-speed inference and memory efficiency, validated on commodity and datacenter GPUs:

Inference Throughput: On an RTX 4090, ModernBERT-large achieves 52.3 kTok/s (long, 8k) matching its short-context speed (≈2× BERT-large for long context). Maximum supported batch sizes (long context) reach 48 (Warner et al., 18 Dec 2024).
Memory Optimization: FlashAttention2/3 and unpadding minimize memory consumption; alternating local/global reduces global $O(n^2)$ calculations by ≈33% (Warner et al., 18 Dec 2024, Sounack et al., 12 Jun 2025).
Fine-Tuning Recommendations: Batch sizes 16–77 are suggested, with mixed-precision and aggressive sequence packing mandatory for long-context deployment (Warner et al., 18 Dec 2024, Lee et al., 4 Apr 2025, Sounack et al., 12 Jun 2025).

6. Sentence Embedding Behavior and Semantic Analysis

Mean-pooled sentence embeddings from ModernBERT-large align with expected trends for robust encoders (Sugiura et al., 22 Apr 2025):

Alignment (ℓ_{align}) and Uniformity (ℓ_{uniform}) metrics track the separation/spread of positive/example pairs and global vector dispersal, respectively.
During pretraining, uniformity increases rapidly, while alignment degrades, then reach stable oscillation bounded by the "ModernBERT trajectory."
Cosine similarity histograms show that extended pretraining eventually leads to greater overlap between example categories, mirroring other ModernBERT clones.
In Clinical ModernBERT, t-SNE projections of code embeddings exhibit semantically coherent clustering, attributable to ontology-aware masking (Lee et al., 4 Apr 2025).

7. Reproducibility, Model Access, and Deployment Guidance

ModernBERT-large and its principal variants are fully open, with model weights and code repositories publicly accessible.

Model/Variant	Weights (HuggingFace)	Training Code (GitHub)
English ModernBERT-large	"allenai/modernbert-large"	github.com/allenai/modern-bert
Japanese ModernBERT-large	LLM-jp/LLM-jp-modernbert-base	github.com/LLM-jp/LLM-jp-modernbert
Clinical/BioClinical	lindvalllab/BioClinical-ModernBERT	github.com/lindvalllab/BioClinical-ModernBERT
Finnish ModernBERT-large	(see (Reunamo et al., 12 Nov 2025))	github.com/huggingface/transformers

Recommended deployment specifies PyTorch ≥2.x, HuggingFace Transformers, FlashAttention, and GPU hardware appropriate for the desired sequence length (24/40/80 GB VRAM for 2k/4k/8k tokens, respectively) (Warner et al., 18 Dec 2024, Sugiura et al., 22 Apr 2025, Reunamo et al., 12 Nov 2025). Fine-tuning batch sizes, learning rates, and context-length strategies are well-documented per variant and use case.

ModernBERT-large thus defines the prevailing paradigm in long-context encoder models across languages and domains, structurally characterized by RoPE, alternate attention, and memory-efficient design. It achieves state-of-the-art or near state-of-the-art results on evaluation suites spanning NLU, biomedical/clinical text, retrieval, and code, while maintaining hardware scalability and algorithmic transparency (Warner et al., 18 Dec 2024, Sugiura et al., 22 Apr 2025, Lee et al., 4 Apr 2025, Sounack et al., 12 Jun 2025, Reunamo et al., 12 Nov 2025, Clavié et al., 6 Feb 2025).