Papers
Topics
Authors
Recent
Search
2000 character limit reached

ESM-AA: Unified Protein and Molecule Modeling

Updated 25 February 2026
  • ESM-AA is a unified multi-scale protein language model that integrates coarse residue and fine atom-level details through a code-switched, dual positional encoding approach.
  • The model achieves state-of-the-art performance on protein–molecule interaction benchmarks by effectively bridging residue-scale and atomic-scale tasks.
  • Its innovative design using a 12-layer Transformer and dual (residue and atom) encodings enables simultaneous, unified modeling of proteins and small molecules without dedicated modules.

ESM All-Atom (ESM-AA) is a unified, multi-scale protein LLM that integrates coarse residue-sequence reasoning with fine atom-level detail within a Transformer backbone. By leveraging code-switched input representations and a novel multi-scale positional encoding, ESM-AA enables simultaneous modeling of proteins and small molecules, supporting unified molecular reasoning without separate encoders or dedicated modules for distinct molecule types. This approach bridges the traditional limit of residue-scale models, delivering state-of-the-art performance on protein–molecule interaction benchmarks while retaining or exceeding competitive results on standalone protein and molecule tasks (Zheng et al., 2024).

1. Model Architecture and Input Representation

ESM-AA operates on input sequences X=(h1,,hn)X = (h_1,\ldots,h_n), where each token hih_i is either a residue or an atom. Residue tokens ri{A,C,,V}r_i \in \{\mathrm{A},\mathrm{C},\ldots,\mathrm{V}\} are embedded as Eres(ri)E^{\mathrm{res}}(r_i), while atom tokens aia_i are associated with type embeddings Eatom(ai)E^{\mathrm{atom}}(a_i) and 3D coordinates ciR3c_i \in \mathbb{R}^3.

For proteins, ESM-AA employs "unzipping": a random 1% of residues per protein sequence are replaced in situ by their constituent atoms in sequential order, yielding a code-switched sequence such as (r1,,ri,ai,1,,ai,Ni,,rL)(r_1, \ldots, r_i, a_{i,1}, \ldots, a_{i,N_i}, \ldots, r_L). Each atom token inherits the residue order index ii but carries its true 3D coordinate. The original residue token remains in the sequence, providing an implicit alignment between residue-level and atom-level representations. For small molecules, the input is a simple ordered list of atoms, with no residue tokens present.

The backbone is a 12-layer Transformer with 20 attention heads per layer, hidden dimension d=480d=480, and feedforward size of 1920. The only architectural differences from ESM-2 are the use of multi-scale positional encoding and the addition of an atom-level additive bias to attention logits.

2. Multi-Scale Positional Encoding

To resolve positional ambiguity arising from mixed residue and atom tokens, ESM-AA introduces two complementary encodings per token:

  • Residue-Scale Encoding (ERE^R): Implements Rotary Position Embedding (RoPE) as in ESM-2. Each token hih_i (residue or atom) derives its residue index pip_i; atoms "unzipped" from a residue inherit pip_i from their parent residue, while molecule atoms receive pi=0p_i=0. RoPE is applied to pip_i: EiR=RoPE(pi)E^R_i = \mathrm{RoPE}(p_i). This ensures all atoms from the same residue share the RoPE encoding, preserving chain order and supporting residue-level tasks.
  • Atom-Scale Encoding (EAE^A): Injects 3D atomic relations into attention. For atom–atom pairs (hi,hj)(h_i,h_j), EijA=G(cicj)E^A_{ij} = G(\|c_i - c_j\|), where GG is a multi-channel Gaussian kernel over Euclidean distances. For other token combinations, EijA=0E^A_{ij} = 0. These scalars are added directly to the attention logits as an attention bias, encoding continuous spatial relationships.

This dual encoding framework maintains accurate residue positional relations while enabling fine-grained modeling of atomic geometry.

3. Pre-Training Objectives and Protocol

ESM-AA is pre-trained on a mixture of 8 million high-confidence protein structures (from AlphaFold DB, pLDDT>90pLDDT>90) and 19 million small molecules (209 million conformers) in a code-switched fashion. Pre-training optimizes two losses simultaneously:

  • Multi-Scale Masked Language Modeling (MLM): 15% of tokens (residues and atoms) are randomly masked and must be recovered using standard cross-entropy:

LMLM=hMASK(X)logp(hXMASK(X))\mathcal{L}_{\mathrm{MLM}} = -\sum_{h \in \mathrm{MASK}(X)} \log p(h | X_{\smallsetminus \mathrm{MASK}(X)})

  • Pairwise Distance Recovery (PDR): For atoms within each residue, true distances dij=cicjd_{ij} = \|c_i - c_j\| must be predicted from corrupted (noise-added) coordinates d^ij\hat d_{ij}, minimizing squared error:

LPDR=i,jsameresidue(d^ijdij)2\mathcal{L}_{\mathrm{PDR}} = \sum_{i,j \in \mathrm{same\,residue}} (\hat d_{ij} - d_{ij})^2

The total loss is L=wMLMLMLM+wPDRLPDR\mathcal{L} = w_{\mathrm{MLM}}\mathcal{L}_{\mathrm{MLM}} + w_{\mathrm{PDR}}\mathcal{L}_{\mathrm{PDR}}, with wMLM=4.0w_{\mathrm{MLM}}=4.0 and wPDR=10.0w_{\mathrm{PDR}}=10.0.

Training details include: 12-layer Transformer, 2048-token sequence cap, up to 256k tokens per batch, 300k update steps on 16 A100 GPUs over 3 days, Adam optimizer, and initialization from ESM-2 checkpoints.

4. Unified Molecular Modeling Capabilities

By training on both code-switched protein sequences and all-atom molecule sequences, ESM-AA serves as a single model for any molecular class of interest. It preserves residue-level co-evolution knowledge by reuse of ESM-2’s RoPE and checkpoint, while gaining atomic-scale geometric reasoning through explicit atom inputs and pairwise distance learning. No task-specific modules or external molecular graph networks are required.

The model’s code-switching strategy enables implicit alignment of residue and atomic representations, accommodating cross-scale learning and transfer. The result is a Transformer encoder that is equally applicable to proteins, small molecules, or their complexes.

5. Evaluation, Benchmarks, and Ablations

ESM-AA attains strong or state-of-the-art performance on a spectrum of molecular modeling benchmarks:

Task Metric Baseline ESM-AA
Enzyme–Substrate Affinity (KM) MSE / R² / Pearson 0.642 / 0.536 / 0.733 0.599 / 0.566 / 0.753
Drug–Target Affinity (Davis) MSE / CI / rm ≈0.202 / ≈0.907 / ≈0.685 0.191 / 0.906 / 0.759
Virtual Screening (DUD-E, zero-shot) AUROC / BEDROC / EF0.5 ≈76.7% / — / — 80.02% / 39.23% / 28.91
Protein Function (EC, DeepFRI, 95% id.) AUPR (Fₘₐₓ) 0.803 (0.786) 0.82 (0.797)

Performance is consistently competitive on protein-only tasks (secondary structure, contact prediction) and molecule-only tasks (MoleculeNet property benchmarks, e.g., QM9 MAE=0.00590 vs. Uni-Mol’s 0.00540). ESM-AA also approaches, though does not always surpass, best-in-class methods relying on massive dedicated pocket pre-training (e.g., DrugCLIP).

Ablation studies demonstrate that all novel model components—RoPE, atom bias, MLM, PDR objective, code-switching, and mixed-data training—are instrumental for optimal performance. Removing any module, or excluding molecule/protein data or the unzipping operation, results in measurable degradation on protein–molecule tasks.

6. Strengths, Limitations, and Future Directions

Strengths of ESM-AA include:

  • Unified modeling of proteins and small molecules in a single 35M-parameter Transformer.
  • Preservation of ESM-2’s excellence on residue-level tasks while adding explicit atomic geometry and chemistry.
  • Superior results relative to model fusions (e.g., ESM-2+Uni-Mol) for protein–molecule interactions.
  • Competitive or state-of-the-art performance across classic protein, molecular, and interaction prediction datasets.

Limitations:

  • Only ≈1% of residues are ever "unzipped" per protein; the model never observes fully atomized proteins end-to-end.
  • Pairwise distance recovery (PDR) only considers intra-residue atom pairs and ignores inter-residue atomic contacts.
  • 3D geometry is incorporated solely via non-equivariant attention biases, not equivariant layers.

Anticipated directions include dynamic or functionally guided unzipping for better exposure of active-site residues, incorporation of E(3)- or SE(3)-equivariant primitives, extension to additional macromolecular interfaces (protein–protein, protein–RNA), and joint structure-plus-sequence generative modeling for de novo enzyme and ligand design.

7. Significance in the Context of Protein and Molecular Modeling

ESM-AA demonstrates that multi-scale pre-training using code-switched, mixed-resolution representations, combined with dual-scale positional encodings, allows Transformer models to bridge residue-level and atom-level molecular representations. It establishes that a single architecture can simultaneously excel at traditional protein LLM tasks and tasks demanding atomistic detail, including protein–molecule affinity and virtual screening—without bespoke pocket modules or external graph encoders. This unification supports future directions in end-to-end, cross-modal molecular design spanning the chemical and biological macromolecule space (Zheng et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ESM All-Atom (ESM-AA).