Papers
Topics
Authors
Recent
Search
2000 character limit reached

LaMPSite Pipeline Overview

Updated 22 February 2026
  • LaMPSite is a computational framework that predicts 3D ligand binding sites using protein language models and graph neural networks from sequence and ligand graph inputs.
  • It integrates distinct protein and ligand encoding modules to extract residue-level embeddings and graph representations without requiring experimentally determined 3D structures.
  • The pipeline employs geometric-aware interaction updates, ensemble merging, and spatial clustering to achieve competitive performance on benchmark datasets and support drug discovery.

LaMPSite is an end-to-end computational pipeline for predicting 3D ligand binding sites from protein sequence without the need for experimentally determined protein structures. The architecture combines protein LLMs and graph neural networks (GNNs) to infer protein-ligand interactions, leveraging residue-level embeddings, predicted contact maps, ligand graph representations, and geometric-aware neural modules. LaMPSite achieves binding site predictions competitive with structure-based baselines using inputs limited to protein sequences and ligand molecular graphs, addressing critical limitations for novel or poorly characterized proteins (Zhang et al., 2023).

1. Input Encoding and Feature Extraction

LaMPSite processes protein sequences and ligand molecular graphs through distinct but integrated encoding modules:

  • Protein Encoder (ESM-2-650M): Given a protein of length npn_p, the ESM-2-650M LLM computes residue-level embeddings

hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.

Simultaneously, ESM-2 produces a residue-residue contact probability matrix

Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.

Embeddings can be projected to a reduced dimension d=128d = 128 via a learned linear transformation:

h~ip=Wphip+bpR128.\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.

  • Ligand Encoder (GNN): The ligand, modeled as an undirected graph (V,E)(V,E) with nln_l atoms, is represented using one-hot atom-type encodings xjx_j. A four-layer message-passing GNN generates hidden vectors:

hl=[h1l,,hnll]Rnl×128.h^l = \begin{bmatrix} h^l_1, \dots, h^l_{n_l} \end{bmatrix}^\top \in \mathbb{R}^{n_l \times 128}.

Independently, a 3D conformer is generated using RDKit, producing a ligand atom distance matrix:

Dl=[xjxk2]j,k=1nlRnl×nl.D^l = \left[\|x_j-x_k\|_2\right]_{j,k=1}^{n_l} \in \mathbb{R}^{n_l \times n_l}.

This approach allows LaMPSite to exploit both sequence-derived and graph-based molecular features without requiring 3D protein coordinates.

2. Protein–Ligand Interaction Embedding

LaMPSite constructs and iteratively updates protein-ligand interaction embeddings:

  • Initialization:

hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.0

where elementwise multiplication hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.1 is performed for all residue–atom pairs, yielding hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.2.

  • Trigonometry-Aware Update (2 Blocks): Two stacked blocks perform geometric-aware updates using a variant of Evoformer pairwise updates:

    1. Protein-Side Triangle Multiplication:

    hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.3

  1. Ligand-Side Triangle Multiplication:

    hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.4

  2. Residual and Feed-Forward:

    hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.5

    hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.6

    All projections and feed-forward networks use hidden size hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.7, ReLU activations, dropout rate hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.8.

This mechanism enables contextualization of protein–ligand interactions with inferred geometric and topological signals, without requiring 3D coordinates.

3. Binding Site Prediction and Loss Function

Predictions are aggregated and evaluated through a multi-stage pooling and scoring procedure:

  • Merging and Scoring:

hp=[h1p,,hnpp]Rnp×1280.h^p = \begin{bmatrix} h^p_1,\dots,h^p_{n_p} \end{bmatrix}^\top \in \mathbb{R}^{n_p \times 1280}.9

Residue-level site scores are given by

Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.0

A sigmoid transformation yields residue probabilities

Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.1

  • Loss Function: Binary cross-entropy between predicted Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.2 and ground-truth labels Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.3:

Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.4

This approach allows the model to learn residue-level binding site assignment from weakly or ambiguously labeled data.

4. Inference, Clustering, and Ranking

Site prediction at inference employs probabilistic thresholding and spatial clustering to identify and rank binding pockets:

  • Thresholding: Residues with Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.5 are retained.
  • Clustering: Single-linkage clustering is performed on selected residues, using Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.6 as the affinity metric (inverse distance).
  • Cluster Scoring: Each cluster Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.7 is scored as Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.8.
  • Cluster Selection: The top–Cp=[Cikp]i,k=1np[0,1]np×np.C^p = \left[C^p_{ik}\right]_{i,k=1}^{n_p} \in [0,1]^{n_p \times n_p}.9 clusters are selected, where d=128d = 1280 matches the known number of binding pockets, to compute DCA (distance cutoff accuracy) success rates.

This inference strategy integrates probabilistic prediction, geometric affinity, and cluster ranking.

5. Architectural Hyperparameters and Training Procedure

Key network design and optimization parameters:

Module Dimension / Setting Details
Protein encoder d=128d = 1281, project to 128 ESM-2-650M
Ligand encoder 4-layer GNN, hidden 128 ReLU
Interaction module 2 trigonometry blocks Hidden 32, dropout 0.25
Linear head d=128d = 1282 d=128d = 1283
Optimizer Adam Learning rate d=128d = 1284
Batch size 8
Max epochs 30 Early stopping patience 4

This configuration realizes a lightweight yet expressive model, balancing computational efficiency with predictive accuracy.

6. Empirical Performance and Ablations

On the COACH420 benchmark dataset (DCA d=128d = 1285 Å), LaMPSite demonstrates competitive site prediction performance as shown below:

Method Top-d=128d = 1286 Success Rate (\%)
LaMPSite (sequence only) 66.02
LaMPSiteᵌʰᵒˡᵒ (contacts) 67.96
Fpocket ~54
DeepSite ~57
Kalasanty ~59
DeepPocket 67.96
P2Rank 68.24

Inference time is approximately d=128d = 1287 seconds per protein on an NVIDIA V100.

Ablation results indicate contributions of core modules (Top-d=128d = 1288 DCA at 4 Å):

  • Without interaction module: d=128d = 1289
  • Without merging h~ip=Wphip+bpR128.\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.0 and h~ip=Wphip+bpR128.\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.1: h~ip=Wphip+bpR128.\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.2
  • Without clustering/ranking: h~ip=Wphip+bpR128.\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.3

This suggests interaction modeling and ensemble merging are essential for optimal predictive performance, while clustering and ranking further improve site localization.

7. Applications and Implications

LaMPSite circumvents the need for experimental protein 3D structures, requiring only sequence and ligand graph inputs. With less than h~ip=Wphip+bpR128.\widetilde{h}^p_i = W^p h^p_i + b^p \in \mathbb{R}^{128}.4 of proteins possessing reliable structure information, this framework expands the applicability of binding site prediction to novel or poorly structurally characterized proteins. The method generates competitive results versus structure-based approaches (Zhang et al., 2023), supporting applications in protein function elucidation and drug discovery where resolved structures are not available. A plausible implication is the acceleration of in silico screening pipelines by reducing structural bottlenecks in target identification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LaMPSite Pipeline.