Papers
Topics
Authors
Recent
Search
2000 character limit reached

LaMPSite Pipeline Overview

Updated 22 February 2026
  • LaMPSite is a computational framework that predicts 3D ligand binding sites using protein language models and graph neural networks from sequence and ligand graph inputs.
  • It integrates distinct protein and ligand encoding modules to extract residue-level embeddings and graph representations without requiring experimentally determined 3D structures.
  • The pipeline employs geometric-aware interaction updates, ensemble merging, and spatial clustering to achieve competitive performance on benchmark datasets and support drug discovery.

LaMPSite is an end-to-end computational pipeline for predicting 3D ligand binding sites from protein sequence without the need for experimentally determined protein structures. The architecture combines protein LLMs and graph neural networks (GNNs) to infer protein-ligand interactions, leveraging residue-level embeddings, predicted contact maps, ligand graph representations, and geometric-aware neural modules. LaMPSite achieves binding site predictions competitive with structure-based baselines using inputs limited to protein sequences and ligand molecular graphs, addressing critical limitations for novel or poorly characterized proteins (Zhang et al., 2023).

1. Input Encoding and Feature Extraction

LaMPSite processes protein sequences and ligand molecular graphs through distinct but integrated encoding modules:

  • Protein Encoder (ESM-2-650M): Given a protein of length npn_p, the ESM-2-650M LLM computes residue-level embeddings

hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.

Simultaneously, ESM-2 produces a residue-residue contact probability matrix

Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.

Embeddings can be projected to a reduced dimension d=128d = 128 via a learned linear transformation:

h~ip=Wphip+bpR128.\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.

  • Ligand Encoder (GNN): The ligand, modeled as an undirected graph (V,E)(V,E) with nln_l atoms, is represented using one-hot atom-type encodings xjx_j. A four-layer message-passing GNN generates hidden vectors:

hl=[h1l,,hnll]Rnl×128.h^l = \begin{bmatrix} h^l_1, \dots, h^l_{n_l} \end{bmatrix}^\top \in \mathbb{R}^{n_l \times 128}.

Independently, a 3D conformer is generated using RDKit, producing a ligand atom distance matrix:

Dl=[xjxk2]j,k=1nlRnl×nl.D^l = \left[\|x_j-x_k\|_2\right]_{j,k=1}^{n_l} \in \mathbb{R}^{n_l \times n_l}.

This approach allows LaMPSite to exploit both sequence-derived and graph-based molecular features without requiring 3D protein coordinates.

2. Protein–Ligand Interaction Embedding

LaMPSite constructs and iteratively updates protein-ligand interaction embeddings:

  • Initialization:

zij(0)=h~iphjlR128,z^{(0)}_{ij} = \widetilde{h}^p_i \odot h^l_j \in \mathbb{R}^{128},

where elementwise multiplication \odot is performed for all residue–atom pairs, yielding z(0)Rnp×nl×128z^{(0)} \in \mathbb{R}^{n_p \times n_l \times 128}.

  • Trigonometry-Aware Update (2 Blocks): Two stacked blocks perform geometric-aware updates using a variant of Evoformer pairwise updates:

    1. Protein-Side Triangle Multiplication:

    Aij(t)=k=1npsoftmaxk(Qq(t)zij(t)+Qk(t)Cikp)zkj(t)A^{(t)}_{ij} = \sum_{k=1}^{n_p} \mathrm{softmax}_k\left( Q^{(t)}_q z^{(t)}_{ij} + Q^{(t)}_k C^p_{ik} \right) z^{(t)}_{k j}

  1. Ligand-Side Triangle Multiplication:

    Bij(t)=k=1nlsoftmaxk(Rq(t)zij(t)+Rk(t)Djkl)zik(t)B^{(t)}_{ij} = \sum_{k=1}^{n_l} \mathrm{softmax}_k\left( R^{(t)}_q z^{(t)}_{ij} + R^{(t)}_k D^l_{jk} \right) z^{(t)}_{i k}

  2. Residual and Feed-Forward:

    z^ij(t)=zij(t)+LayerNorm(Aij(t)+Bij(t))\widehat z^{(t)}_{ij} = z^{(t)}_{ij} + \mathrm{LayerNorm}\left(A^{(t)}_{ij}+B^{(t)}_{ij}\right)

    zij(t+1)=z^ij(t)+Dropout(FFN(LayerNorm(z^ij(t))))z^{(t+1)}_{ij} = \widehat z^{(t)}_{ij} + \mathrm{Dropout}\left(\mathrm{FFN}(\mathrm{LayerNorm}(\widehat z^{(t)}_{ij}))\right)

    All projections and feed-forward networks use hidden size $32$, ReLU activations, dropout rate $0.25$.

This mechanism enables contextualization of protein–ligand interactions with inferred geometric and topological signals, without requiring 3D coordinates.

3. Binding Site Prediction and Loss Function

Predictions are aggregated and evaluated through a multi-stage pooling and scoring procedure:

  • Merging and Scoring:

Zij=Wo(zij(0)+zij(T))Rnp×nl×1Z_{ij} = W_o(z^{(0)}_{ij} + z^{(T)}_{ij}) \in \mathbb{R}^{n_p \times n_l \times 1}

Residue-level site scores are given by

si=j=1nlZijs_i = \sum_{j=1}^{n_l} Z_{ij}

A sigmoid transformation yields residue probabilities

pi=σ(si)=11+esip_i = \sigma(s_i) = \frac{1}{1+e^{-s_i}}

  • Loss Function: Binary cross-entropy between predicted pip_i and ground-truth labels yi{0,1}y_i \in \{0,1\}:

L=1npi=1np[yilogpi+(1yi)log(1pi)]\mathcal{L} = -\frac{1}{n_p} \sum_{i=1}^{n_p} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right]

This approach allows the model to learn residue-level binding site assignment from weakly or ambiguously labeled data.

4. Inference, Clustering, and Ranking

Site prediction at inference employs probabilistic thresholding and spatial clustering to identify and rank binding pockets:

  • Thresholding: Residues with pi0.63p_i \geq 0.63 are retained.
  • Clustering: Single-linkage clustering is performed on selected residues, using CikpC^p_{ik} as the affinity metric (inverse distance).
  • Cluster Scoring: Each cluster C\mathcal{C} is scored as iCsi2\sum_{i \in \mathcal{C}} s_i^2.
  • Cluster Selection: The top–nn clusters are selected, where nn matches the known number of binding pockets, to compute DCA (distance cutoff accuracy) success rates.

This inference strategy integrates probabilistic prediction, geometric affinity, and cluster ranking.

5. Architectural Hyperparameters and Training Procedure

Key network design and optimization parameters:

Module Dimension / Setting Details
Protein encoder dp=1280d_p=1280, project to 128 ESM-2-650M
Ligand encoder 4-layer GNN, hidden 128 ReLU
Interaction module 2 trigonometry blocks Hidden 32, dropout 0.25
Linear head WoW_o R32R1\mathbb{R}^{32} \to \mathbb{R}^1
Optimizer Adam Learning rate 5×1045\times10^{-4}
Batch size 8
Max epochs 30 Early stopping patience 4

This configuration realizes a lightweight yet expressive model, balancing computational efficiency with predictive accuracy.

6. Empirical Performance and Ablations

On the COACH420 benchmark dataset (DCA 4\leq 4 Å), LaMPSite demonstrates competitive site prediction performance as shown below:

Method Top-nn Success Rate (\%)
LaMPSite (sequence only) 66.02
LaMPSiteᵌʰᵒˡᵒ (contacts) 67.96
Fpocket ~54
DeepSite ~57
Kalasanty ~59
DeepPocket 67.96
P2Rank 68.24

Inference time is approximately $0.2$ seconds per protein on an NVIDIA V100.

Ablation results indicate contributions of core modules (Top-nn DCA at 4 Å):

  • Without interaction module: 62.40%62.40\%
  • Without merging z(0)z^{(0)} and z(T)z^{(T)}: 63.50%63.50\%
  • Without clustering/ranking: 65.18%65.18\%

This suggests interaction modeling and ensemble merging are essential for optimal predictive performance, while clustering and ranking further improve site localization.

7. Applications and Implications

LaMPSite circumvents the need for experimental protein 3D structures, requiring only sequence and ligand graph inputs. With less than 50%50\% of proteins possessing reliable structure information, this framework expands the applicability of binding site prediction to novel or poorly structurally characterized proteins. The method generates competitive results versus structure-based approaches (Zhang et al., 2023), supporting applications in protein function elucidation and drug discovery where resolved structures are not available. A plausible implication is the acceleration of in silico screening pipelines by reducing structural bottlenecks in target identification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LaMPSite Pipeline.