LaMPSite Pipeline Overview

Updated 22 February 2026

LaMPSite is a computational framework that predicts 3D ligand binding sites using protein language models and graph neural networks from sequence and ligand graph inputs.
It integrates distinct protein and ligand encoding modules to extract residue-level embeddings and graph representations without requiring experimentally determined 3D structures.
The pipeline employs geometric-aware interaction updates, ensemble merging, and spatial clustering to achieve competitive performance on benchmark datasets and support drug discovery.

LaMPSite is an end-to-end computational pipeline for predicting 3D ligand binding sites from protein sequence without the need for experimentally determined protein structures. The architecture combines protein LLMs and graph neural networks (GNNs) to infer protein-ligand interactions, leveraging residue-level embeddings, predicted contact maps, ligand graph representations, and geometric-aware neural modules. LaMPSite achieves binding site predictions competitive with structure-based baselines using inputs limited to protein sequences and ligand molecular graphs, addressing critical limitations for novel or poorly characterized proteins (Zhang et al., 2023).

1. Input Encoding and Feature Extraction

LaMPSite processes protein sequences and ligand molecular graphs through distinct but integrated encoding modules:

Protein Encoder (ESM-2-650M): Given a protein of length $n_p$ , the ESM-2-650M LLM computes residue-level embeddings

$h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$

Simultaneously, ESM-2 produces a residue-residue contact probability matrix

$C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$

Embeddings can be projected to a reduced dimension $d = 128$ via a learned linear transformation:

$\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.$

Ligand Encoder (GNN): The ligand, modeled as an undirected graph $(V,E)$ with $n_l$ atoms, is represented using one-hot atom-type encodings $x_j$ . A four-layer message-passing GNN generates hidden vectors:

$h^l = \begin{bmatrix} h^l_1, \dots, h^l_{n_l} \end{bmatrix}^\top \in \mathbb{R}^{n_l \times 128}.$

Independently, a 3D conformer is generated using RDKit, producing a ligand atom distance matrix:

$D^l = \left[\|x_j-x_k\|_2\right]_{j,k=1}^{n_l} \in \mathbb{R}^{n_l \times n_l}.$

This approach allows LaMPSite to exploit both sequence-derived and graph-based molecular features without requiring 3D protein coordinates.

2. Protein–Ligand Interaction Embedding

LaMPSite constructs and iteratively updates protein-ligand interaction embeddings:

Initialization:

$h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$ 0

where elementwise multiplication $h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$ 1 is performed for all residue–atom pairs, yielding $h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$ 2.

Trigonometry-Aware Update (2 Blocks): Two stacked blocks perform geometric-aware updates using a variant of Evoformer pairwise updates:
1. Protein-Side Triangle Multiplication:
$h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$ 3

Ligand-Side Triangle Multiplication:

$h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$ 4
Residual and Feed-Forward:

$h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$ 5

$h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$ 6

All projections and feed-forward networks use hidden size $h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$ 7, ReLU activations, dropout rate $h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$ 8.

This mechanism enables contextualization of protein–ligand interactions with inferred geometric and topological signals, without requiring 3D coordinates.

3. Binding Site Prediction and Loss Function

Predictions are aggregated and evaluated through a multi-stage pooling and scoring procedure:

Merging and Scoring:

$h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.$ 9

Residue-level site scores are given by

$C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$ 0

A sigmoid transformation yields residue probabilities

$C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$ 1

Loss Function: Binary cross-entropy between predicted $C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$ 2 and ground-truth labels $C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$ 3:

$C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$ 4

This approach allows the model to learn residue-level binding site assignment from weakly or ambiguously labeled data.

4. Inference, Clustering, and Ranking

Site prediction at inference employs probabilistic thresholding and spatial clustering to identify and rank binding pockets:

Thresholding: Residues with $C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$ 5 are retained.
Clustering: Single-linkage clustering is performed on selected residues, using $C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$ 6 as the affinity metric (inverse distance).
Cluster Scoring: Each cluster $C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$ 7 is scored as $C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$ 8.
Cluster Selection: The top– $C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.$ 9 clusters are selected, where $d = 128$ 0 matches the known number of binding pockets, to compute DCA (distance cutoff accuracy) success rates.

This inference strategy integrates probabilistic prediction, geometric affinity, and cluster ranking.

5. Architectural Hyperparameters and Training Procedure

Key network design and optimization parameters:

Module	Dimension / Setting	Details
Protein encoder	$d = 128$ 1, project to 128	ESM-2-650M
Ligand encoder	4-layer GNN, hidden 128	ReLU
Interaction module	2 trigonometry blocks	Hidden 32, dropout 0.25
Linear head $d = 128$ 2	$d = 128$ 3
Optimizer	Adam	Learning rate $d = 128$ 4
Batch size	8
Max epochs	30	Early stopping patience 4

This configuration realizes a lightweight yet expressive model, balancing computational efficiency with predictive accuracy.

6. Empirical Performance and Ablations

On the COACH420 benchmark dataset (DCA $d = 128$ 5 Å), LaMPSite demonstrates competitive site prediction performance as shown below:

Method	Top- $d = 128$ 6 Success Rate (\%)
LaMPSite (sequence only)	66.02
LaMPSiteᵌʰᵒˡᵒ (contacts)	67.96
Fpocket	~54
DeepSite	~57
Kalasanty	~59
DeepPocket	67.96
P2Rank	68.24

Inference time is approximately $d = 128$ 7 seconds per protein on an NVIDIA V100.

Ablation results indicate contributions of core modules (Top- $d = 128$ 8 DCA at 4 Å):

Without interaction module: $d = 128$ 9
Without merging $\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.$ 0 and $\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.$ 1: $\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.$ 2
Without clustering/ranking: $\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.$ 3

This suggests interaction modeling and ensemble merging are essential for optimal predictive performance, while clustering and ranking further improve site localization.

7. Applications and Implications

LaMPSite circumvents the need for experimental protein 3D structures, requiring only sequence and ligand graph inputs. With less than $\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.$ 4 of proteins possessing reliable structure information, this framework expands the applicability of binding site prediction to novel or poorly structurally characterized proteins. The method generates competitive results versus structure-based approaches (Zhang et al., 2023), supporting applications in protein function elucidation and drug discovery where resolved structures are not available. A plausible implication is the acceleration of in silico screening pipelines by reducing structural bottlenecks in target identification.

Markdown Report Issue Upgrade to Chat

References (1)

Protein Language Model-Powered 3D Ligand Binding Site Prediction from Protein Sequence (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LaMPSite Pipeline.