ESM Series in Protein Modeling

Updated 14 December 2025

ESM Series is a family of Transformer-based protein language models, scaling from 650M to 98B parameters, designed for unsupervised learning on protein sequences.
These models incorporate innovations like folding heads and chain-of-thought generative design, enabling direct 3D structure prediction and enhanced mutation effect mapping.
They utilize self-supervised masked language modeling on vast protein datasets, achieving state-of-the-art performance in perplexity, contact prediction, and structure accuracy.

An ESM Series may refer to one of several advanced scientific, engineering, or statistical frameworks, each denoted by the initialism "ESM," that underpin distinct domains such as protein language modeling (Evolutionary-Scale Modeling), climate science (Earth System Models), advanced wireless communications (Enhanced Spatial Modulation), near-field acoustic imaging (Equivalent Source Method), text-to-SQL evaluation (Exact Set Matching), or observational astronomy (Enhanced Seeing Mode). The following article concentrates on the ESM Series in protein modeling, as presented in survey literature and corroborated by original sources. Subsections treat key advances, architectures, training protocols, performance, and domain-specific impact of ESM in the protein LLM context.

1. Overview and Historical Development

The ESM (Evolutionary-Scale Modeling) Series is a family of Transformer-based protein LLMs that scale masked language modeling to hundreds of millions or billions of parameters over the evolutionary-scale database of protein sequences. Emerging with ESM-1b, the approach leveraged the Transformer encoder to learn unsupervised representations encapsulating structural and functional information from UniParc-scale corpora. The generational progression from ESM-1b (650 M parameters) through ESM-2 (15 B) to ESM-3 (98 B) marks continual expansion in model and training corpus size, as well as increasing multimodal sophistication (incorporating structure and annotation) (Xiao et al., 21 Feb 2025). Parallel lines such as ESM-1v (variant prediction), ESM-IF (inverse folding), and ESMFold further distinguish the series' capacity for single-sequence structure prediction, mutation effect estimation, and protein design.

2. Model Architectures and Innovations

The canonical ESM-1b model adopts a vanilla Transformer encoder architecture, scaling to 33 blocks (hidden size 1,280; 20 heads; MLP 5,120 dimensions), parameterized at 669 M (Xiao et al., 21 Feb 2025). No architectural modifications to self-attention are introduced in the 1b/1v lines. ESM-2 augments the backbone with a "folding head"—a lightweight structure prediction module mapping learned representations to Cα coordinates, drawing on invariant point attention for end-to-end single-sequence folding (ESMFold). ESM-3 scales to ≈98 B parameters and operates as a multimodal generative chain-of-thought transformer, trained on sequence, structure, and annotation. ESM-IF departs from standard models by employing a Geometric Vector Perceptron (GVP) for structure encoding and a graph neural network component for inverse folding, with a specialized autoregressive decoder.

Distinct innovations include:

Universal masked language modeling (MLM) loss over single-sequence data, without requiring explicit multiple-sequence alignments (MSAs).
In ESMFold, the direct prediction of 3D structures from sequence alone at exceptional throughput.
ESM-3's chain-of-thought design enabling semantically conditioned, multi-property protein design.

3. Training Data, Pretraining Objectives, and Fine-tuning

All ESM Series models are trained using self-supervised masked language modeling. For a protein sequence $x=(x_1,\ldots,x_L)$ with mask set $\mathcal{M}\subset\{1,\ldots,L\}$ : $\mathcal{L}_{\mathrm{MLM}} = -\sum_{i\in \mathcal{M}} \log P(x_i\,|\,x_{\backslash \mathcal{M}})$ ESM-1b and ESM-2 pretrain on 250 M and 86 M non-redundant UniProt or UniRef50 sequences, masking 15% of residues. ESM-1v's focus is mutation prediction, with pretraining inherited from ESM-1b and downstream zero-shot evaluation on mutation effects. ESM-IF's objective is inverse folding: predicting sequence from 3D structure, using AlphaFold2-generated structural databases.

ESM-3 extends the MLM to handle joint inputs—sequence, structure, and Gene Ontology-style annotation—and targets generative design. Across versions, fine-tuning is dataset and task-specific, ranging from deep mutational scanning benchmarks to structural, functional, or evolutionary property prediction.

4. Evaluation Protocols and Benchmark Results

ESM Series evaluation encompasses language modeling perplexity, structure prediction accuracy, contact precision, and biological function inference. Representative metrics from (Xiao et al., 21 Feb 2025):

Masked language modeling perplexity: ESM-1b achieves PPL ≈ 6–7 on held-out UniParc splits.
Mask recovery: ~46% at 15% masking.
Zero-shot mutation effect: Spearman ρ ≈ 0.50 (ESM-1b), improving to ρ ≈ 0.60 (ESM-1v), outperforming LSTM and non-LLM baselines.
Contact prediction: P@L/5 ≈ 0.61 (ESM-1b).
ESMFold (ESM-2) achieves TM-score 0.74 in CASP14 single-sequence structure prediction, with ~1s runtime for 300-residue chains.
ESM-3's generative design delivers >50% functionality (e.g., fluorescence) in distant sequence families and emission-peak accuracy within ~20 nm for 70% of novel designs.

Comparison to contemporaneous protein LLMs (e.g. ProtTrans, ProteinBERT, MSA Transformer) shows ESM models systematically outperform in perplexity, contact prediction, or zero-shot transfer.

5. Domain Applications and Limitations

Notable applications of the ESM series include:

Zero-shot assessment of variant fitness in human and viral proteins (ESM-1b, ESM-1v).
Fast, high-throughput, single-sequence structure prediction (ESMFold).
Inverse folding and de novo protein design (ESM-IF, ESM-3).
Chain-of-thought driven conditional design leveraging sequence, structure, and function.

Identified limitations include:

Computational scaling: $O(L^2)$ attention restricts tractability to proteins $\lesssim 1,500$ residues, and largest models (15–98 B) require multi-GPU hardware.
Generalization: Zero-shot variant effect predictions are inconsistent for highly divergent protein families; few-shot or MSA augmentation can be necessary.
Structural reliance: Inverse-folding models are sensitive to structural input quality; errors propagate downstream.
Accessibility: Parameter and hardware requirements limit deployment outside high-performance environments.

6. Recent Extensions: ESM All-Atom and Unified Molecular Modeling

ESM All-Atom (ESM-AA) extends the ESM lineage to atomistic and residue-resolution molecular tasks (Zheng et al., 5 Mar 2024). ESM-AA employs multi-scale code-switching—randomly "unzipping" residues into atom tokens—and fuses multi-scale positional encoding (discrete residue RoPE + atom-level Gaussian kernels) to enable unified protein–small molecule modeling. Pretrained on mixed protein structure and molecule data (~8 M proteins, ~19 M small molecules), ESM-AA demonstrates superior affinity prediction (KM, Davis, ESP datasets) and retains residue-level accuracy on TAPE benchmarks. Ablation experiments confirm that each architectural element—dual positional biases, code-switching, atom/molecule mixture—is essential for state-of-the-art transfer. ESM-AA also demonstrates zero-shot and cross-domain generalizability in virtual screening and molecule-only tasks, positioning it as a truly unified macromolecular LLM.

7. Significance, Impact, and Future Directions

The ESM Series establishes evolutionary-scale, self-supervised Transformer architectures as central tools for protein science, facilitating rapid sequence-based function assignment, mutation effect mapping, and structure-guided design. The expansion into atomistic modeling and multimodal, generative design (ESM-3/ESM-AA) indicates a consolidation of protein LLMs as a canonical foundation model class in biology.

Ongoing challenges include scaling to longer sequences, further integrating experimental structural data, robustly modeling post-translational modifications, and democratizing inference access. The trajectory of improvements—from ESM-1b's single-sequence models to ESM-3's multimodal generation and ESM-AA's molecular unification—offers a roadmap for evolutionary-scale foundation models in bioscience (Xiao et al., 21 Feb 2025, Zheng et al., 5 Mar 2024).