ESM-AA: Unified Protein and Molecule Modeling
- ESM-AA is a unified multi-scale protein language model that integrates coarse residue and fine atom-level details through a code-switched, dual positional encoding approach.
- The model achieves state-of-the-art performance on protein–molecule interaction benchmarks by effectively bridging residue-scale and atomic-scale tasks.
- Its innovative design using a 12-layer Transformer and dual (residue and atom) encodings enables simultaneous, unified modeling of proteins and small molecules without dedicated modules.
ESM All-Atom (ESM-AA) is a unified, multi-scale protein LLM that integrates coarse residue-sequence reasoning with fine atom-level detail within a Transformer backbone. By leveraging code-switched input representations and a novel multi-scale positional encoding, ESM-AA enables simultaneous modeling of proteins and small molecules, supporting unified molecular reasoning without separate encoders or dedicated modules for distinct molecule types. This approach bridges the traditional limit of residue-scale models, delivering state-of-the-art performance on protein–molecule interaction benchmarks while retaining or exceeding competitive results on standalone protein and molecule tasks (Zheng et al., 2024).
1. Model Architecture and Input Representation
ESM-AA operates on input sequences , where each token is either a residue or an atom. Residue tokens are embedded as , while atom tokens are associated with type embeddings and 3D coordinates .
For proteins, ESM-AA employs "unzipping": a random 1% of residues per protein sequence are replaced in situ by their constituent atoms in sequential order, yielding a code-switched sequence such as . Each atom token inherits the residue order index but carries its true 3D coordinate. The original residue token remains in the sequence, providing an implicit alignment between residue-level and atom-level representations. For small molecules, the input is a simple ordered list of atoms, with no residue tokens present.
The backbone is a 12-layer Transformer with 20 attention heads per layer, hidden dimension , and feedforward size of 1920. The only architectural differences from ESM-2 are the use of multi-scale positional encoding and the addition of an atom-level additive bias to attention logits.
2. Multi-Scale Positional Encoding
To resolve positional ambiguity arising from mixed residue and atom tokens, ESM-AA introduces two complementary encodings per token:
- Residue-Scale Encoding (): Implements Rotary Position Embedding (RoPE) as in ESM-2. Each token (residue or atom) derives its residue index ; atoms "unzipped" from a residue inherit from their parent residue, while molecule atoms receive . RoPE is applied to : . This ensures all atoms from the same residue share the RoPE encoding, preserving chain order and supporting residue-level tasks.
- Atom-Scale Encoding (): Injects 3D atomic relations into attention. For atom–atom pairs , , where is a multi-channel Gaussian kernel over Euclidean distances. For other token combinations, . These scalars are added directly to the attention logits as an attention bias, encoding continuous spatial relationships.
This dual encoding framework maintains accurate residue positional relations while enabling fine-grained modeling of atomic geometry.
3. Pre-Training Objectives and Protocol
ESM-AA is pre-trained on a mixture of 8 million high-confidence protein structures (from AlphaFold DB, ) and 19 million small molecules (209 million conformers) in a code-switched fashion. Pre-training optimizes two losses simultaneously:
- Multi-Scale Masked Language Modeling (MLM): 15% of tokens (residues and atoms) are randomly masked and must be recovered using standard cross-entropy:
- Pairwise Distance Recovery (PDR): For atoms within each residue, true distances must be predicted from corrupted (noise-added) coordinates , minimizing squared error:
The total loss is , with and .
Training details include: 12-layer Transformer, 2048-token sequence cap, up to 256k tokens per batch, 300k update steps on 16 A100 GPUs over 3 days, Adam optimizer, and initialization from ESM-2 checkpoints.
4. Unified Molecular Modeling Capabilities
By training on both code-switched protein sequences and all-atom molecule sequences, ESM-AA serves as a single model for any molecular class of interest. It preserves residue-level co-evolution knowledge by reuse of ESM-2’s RoPE and checkpoint, while gaining atomic-scale geometric reasoning through explicit atom inputs and pairwise distance learning. No task-specific modules or external molecular graph networks are required.
The model’s code-switching strategy enables implicit alignment of residue and atomic representations, accommodating cross-scale learning and transfer. The result is a Transformer encoder that is equally applicable to proteins, small molecules, or their complexes.
5. Evaluation, Benchmarks, and Ablations
ESM-AA attains strong or state-of-the-art performance on a spectrum of molecular modeling benchmarks:
| Task | Metric | Baseline | ESM-AA |
|---|---|---|---|
| Enzyme–Substrate Affinity (KM) | MSE / R² / Pearson | 0.642 / 0.536 / 0.733 | 0.599 / 0.566 / 0.753 |
| Drug–Target Affinity (Davis) | MSE / CI / rm | ≈0.202 / ≈0.907 / ≈0.685 | 0.191 / 0.906 / 0.759 |
| Virtual Screening (DUD-E, zero-shot) | AUROC / BEDROC / EF0.5 | ≈76.7% / — / — | 80.02% / 39.23% / 28.91 |
| Protein Function (EC, DeepFRI, 95% id.) | AUPR (Fₘₐₓ) | 0.803 (0.786) | 0.82 (0.797) |
Performance is consistently competitive on protein-only tasks (secondary structure, contact prediction) and molecule-only tasks (MoleculeNet property benchmarks, e.g., QM9 MAE=0.00590 vs. Uni-Mol’s 0.00540). ESM-AA also approaches, though does not always surpass, best-in-class methods relying on massive dedicated pocket pre-training (e.g., DrugCLIP).
Ablation studies demonstrate that all novel model components—RoPE, atom bias, MLM, PDR objective, code-switching, and mixed-data training—are instrumental for optimal performance. Removing any module, or excluding molecule/protein data or the unzipping operation, results in measurable degradation on protein–molecule tasks.
6. Strengths, Limitations, and Future Directions
Strengths of ESM-AA include:
- Unified modeling of proteins and small molecules in a single 35M-parameter Transformer.
- Preservation of ESM-2’s excellence on residue-level tasks while adding explicit atomic geometry and chemistry.
- Superior results relative to model fusions (e.g., ESM-2+Uni-Mol) for protein–molecule interactions.
- Competitive or state-of-the-art performance across classic protein, molecular, and interaction prediction datasets.
Limitations:
- Only ≈1% of residues are ever "unzipped" per protein; the model never observes fully atomized proteins end-to-end.
- Pairwise distance recovery (PDR) only considers intra-residue atom pairs and ignores inter-residue atomic contacts.
- 3D geometry is incorporated solely via non-equivariant attention biases, not equivariant layers.
Anticipated directions include dynamic or functionally guided unzipping for better exposure of active-site residues, incorporation of E(3)- or SE(3)-equivariant primitives, extension to additional macromolecular interfaces (protein–protein, protein–RNA), and joint structure-plus-sequence generative modeling for de novo enzyme and ligand design.
7. Significance in the Context of Protein and Molecular Modeling
ESM-AA demonstrates that multi-scale pre-training using code-switched, mixed-resolution representations, combined with dual-scale positional encodings, allows Transformer models to bridge residue-level and atom-level molecular representations. It establishes that a single architecture can simultaneously excel at traditional protein LLM tasks and tasks demanding atomistic detail, including protein–molecule affinity and virtual screening—without bespoke pocket modules or external graph encoders. This unification supports future directions in end-to-end, cross-modal molecular design spanning the chemical and biological macromolecule space (Zheng et al., 2024).