LaMPSite Pipeline Overview
- LaMPSite is a computational framework that predicts 3D ligand binding sites using protein language models and graph neural networks from sequence and ligand graph inputs.
- It integrates distinct protein and ligand encoding modules to extract residue-level embeddings and graph representations without requiring experimentally determined 3D structures.
- The pipeline employs geometric-aware interaction updates, ensemble merging, and spatial clustering to achieve competitive performance on benchmark datasets and support drug discovery.
LaMPSite is an end-to-end computational pipeline for predicting 3D ligand binding sites from protein sequence without the need for experimentally determined protein structures. The architecture combines protein LLMs and graph neural networks (GNNs) to infer protein-ligand interactions, leveraging residue-level embeddings, predicted contact maps, ligand graph representations, and geometric-aware neural modules. LaMPSite achieves binding site predictions competitive with structure-based baselines using inputs limited to protein sequences and ligand molecular graphs, addressing critical limitations for novel or poorly characterized proteins (Zhang et al., 2023).
1. Input Encoding and Feature Extraction
LaMPSite processes protein sequences and ligand molecular graphs through distinct but integrated encoding modules:
- Protein Encoder (ESM-2-650M): Given a protein of length , the ESM-2-650M LLM computes residue-level embeddings
Simultaneously, ESM-2 produces a residue-residue contact probability matrix
Embeddings can be projected to a reduced dimension via a learned linear transformation:
- Ligand Encoder (GNN): The ligand, modeled as an undirected graph with atoms, is represented using one-hot atom-type encodings . A four-layer message-passing GNN generates hidden vectors:
Independently, a 3D conformer is generated using RDKit, producing a ligand atom distance matrix:
This approach allows LaMPSite to exploit both sequence-derived and graph-based molecular features without requiring 3D protein coordinates.
2. Protein–Ligand Interaction Embedding
LaMPSite constructs and iteratively updates protein-ligand interaction embeddings:
- Initialization:
where elementwise multiplication is performed for all residue–atom pairs, yielding .
- Trigonometry-Aware Update (2 Blocks): Two stacked blocks perform geometric-aware updates using a variant of Evoformer pairwise updates:
- Protein-Side Triangle Multiplication:
- Ligand-Side Triangle Multiplication:
- Residual and Feed-Forward:
All projections and feed-forward networks use hidden size $32$, ReLU activations, dropout rate $0.25$.
This mechanism enables contextualization of protein–ligand interactions with inferred geometric and topological signals, without requiring 3D coordinates.
3. Binding Site Prediction and Loss Function
Predictions are aggregated and evaluated through a multi-stage pooling and scoring procedure:
- Merging and Scoring:
Residue-level site scores are given by
A sigmoid transformation yields residue probabilities
- Loss Function: Binary cross-entropy between predicted and ground-truth labels :
This approach allows the model to learn residue-level binding site assignment from weakly or ambiguously labeled data.
4. Inference, Clustering, and Ranking
Site prediction at inference employs probabilistic thresholding and spatial clustering to identify and rank binding pockets:
- Thresholding: Residues with are retained.
- Clustering: Single-linkage clustering is performed on selected residues, using as the affinity metric (inverse distance).
- Cluster Scoring: Each cluster is scored as .
- Cluster Selection: The top– clusters are selected, where matches the known number of binding pockets, to compute DCA (distance cutoff accuracy) success rates.
This inference strategy integrates probabilistic prediction, geometric affinity, and cluster ranking.
5. Architectural Hyperparameters and Training Procedure
Key network design and optimization parameters:
| Module | Dimension / Setting | Details |
|---|---|---|
| Protein encoder | , project to 128 | ESM-2-650M |
| Ligand encoder | 4-layer GNN, hidden 128 | ReLU |
| Interaction module | 2 trigonometry blocks | Hidden 32, dropout 0.25 |
| Linear head | ||
| Optimizer | Adam | Learning rate |
| Batch size | 8 | |
| Max epochs | 30 | Early stopping patience 4 |
This configuration realizes a lightweight yet expressive model, balancing computational efficiency with predictive accuracy.
6. Empirical Performance and Ablations
On the COACH420 benchmark dataset (DCA Å), LaMPSite demonstrates competitive site prediction performance as shown below:
| Method | Top- Success Rate (\%) |
|---|---|
| LaMPSite (sequence only) | 66.02 |
| LaMPSiteᵌʰᵒˡᵒ (contacts) | 67.96 |
| Fpocket | ~54 |
| DeepSite | ~57 |
| Kalasanty | ~59 |
| DeepPocket | 67.96 |
| P2Rank | 68.24 |
Inference time is approximately $0.2$ seconds per protein on an NVIDIA V100.
Ablation results indicate contributions of core modules (Top- DCA at 4 Å):
- Without interaction module:
- Without merging and :
- Without clustering/ranking:
This suggests interaction modeling and ensemble merging are essential for optimal predictive performance, while clustering and ranking further improve site localization.
7. Applications and Implications
LaMPSite circumvents the need for experimental protein 3D structures, requiring only sequence and ligand graph inputs. With less than of proteins possessing reliable structure information, this framework expands the applicability of binding site prediction to novel or poorly structurally characterized proteins. The method generates competitive results versus structure-based approaches (Zhang et al., 2023), supporting applications in protein function elucidation and drug discovery where resolved structures are not available. A plausible implication is the acceleration of in silico screening pipelines by reducing structural bottlenecks in target identification.