Learnable Protein-Ligand Scoring Functions

Updated 12 December 2025

Learnable protein-ligand scoring functions are deep learning models that predict binding affinities from 3D complex geometries using CNNs, GNNs, and equivariant transformers.
They integrate diverse features such as atom types, distance binning, volumetric densities, and graph representations, ensuring robust training with curated datasets and rigorous benchmarks.
These models excel in affinity prediction, pose selection, and virtual screening while facing challenges in generalization and transferability to novel targets.

Learnable protein-ligand scoring functions are machine-learned models that assign quantitative scores to specific protein–ligand complex geometries, aiming to predict binding affinity, identify near-native poses, or rank ligands for virtual screening. Unlike classical empirical scoring functions constructed from hand-tuned physical terms, learnable SFs leverage high-capacity models—especially deep neural networks—to extract statistical patterns from 3D complex structures and reproduce the outcome of experimental or computational binding assays. Modern research has shown that these models can outperform traditional approaches on standard benchmarks for affinity prediction, pose selection, and virtual screening, but also highlighted challenges in transferability and generalization to novel targets.

1. Representations and Model Architectures

3D Grid and Convolutional Approaches

Canonical learnable scoring functions such as those in "Protein-Ligand Scoring with Convolutional Neural Networks" (Ragoza et al., 2016) and "Visualizing Convolutional Neural Network Protein-Ligand Scoring" (Hochuli et al., 2018) encode complexes as multi-channel regular 3D voxel grids. Atom types are mapped to discrete channels, with each atom contributing a continuous density to the grid, and a deep stack of 3D convolutional layers learns to summarize spatial patterns decisive for binding. Affinity regression and/or pose classification heads are trained via appropriate loss functions (MSE for affinities, cross-entropy for classification). Such models automatically learn physical interaction motifs but are not natively rotationally equivariant; they admit data augmentation by on-the-fly rotation/translation. Attention-based augmentations (e.g., channel and spatial attention in ResAtom-Score (Wang et al., 2021)) can further focus learning on key interaction regions.

Rotation-Invariant Shell and Contact-Based Methods

An alternative encoding aggregates atomic contacts into rotationally invariant descriptors, such as in the "OnionNet" CNNs (Zheng et al., 2019, Wang et al., 2021). In these models, the protein–ligand interaction is featurized by counts of atom–atom or residue–atom contacts within concentric spherical shells, stratified by element types. 2D convolutions over the shell-by-type matrices learn higher-order spatial correlations without requiring explicit 3D convolution and at low computational cost. These representations guarantee strict invariance to rigid-body motions, enabling robust training and inference.

Graph Neural Networks and Equivariant Transformers

Recent models encode complexes as fully connected atomistic graphs with rich node and edge features (element, hybridization, bond order, distance encoding) and deploy message-passing GNNs or graph transformers as the core model (e.g., DeepRLI (Lin et al., 19 Jan 2024), IGN (Li et al., 2023)). Modern implementations incorporate physically motivated radial cutoffs, attention modulated by learnable envelopes, and edge-wise physics-inspired blocks. Equivariance to SE(3) can be ensured using spherical harmonics (as in "Equivariant Scalar Fields for Molecular Docking" (Jing et al., 2023)) or E(3)-equivariant message-passing (e.g., force-matching networks in (Brocidiacono et al., 31 May 2025)). Such architectures allow direct modeling of geometric features central to molecular recognition and interaction energetics.

Surface-, Manifold-, and Energy-Based Designs

Other approaches project the protein–ligand interface onto multiscale surface representations (e.g., EISA-score (Rana et al., 2022)) or train architectures that compute per-atom or per-pair energy terms, in analogy to classical force fields but parametrized by neural networks. Example: "Atomic Convolutional Networks" (Gomes et al., 2017) construct radial distance-based features and predict per-atom energy contributions, summing over all atoms to yield a global binding free energy prediction, thus integrating a thermodynamic cycle directly into the model.

Differentiable Optimization and End-to-End Docking

Some learnable SFs (e.g., "Ligand Pose Optimization with Atomic Grid-Based CNNs" (Ragoza et al., 2017), DeepRMSD+Vina (2206.13345)) are constructed to be smoothly differentiable in atomic coordinates, enabling direct gradient-based pose optimization. This property is essential for docking engines that seek to refine ligand orientation and conformation in silico.

2. Input Features and Data Preparation

Standard feature spaces for learnable SFs encompass:

Atom-type counts: Pairwise contact counts by element or chemical type.
Distance binning: Fine-grained shells or bins to capture the spatial arrangement of contacts.
Surface areas: Computed over element-specific interaction manifolds (e.g., EISA (Rana et al., 2022)).
Physical terms: Partial charges, local secondary structure, hydrogen-bond donor/acceptor status, hydrophobicity.
Volumetric densities: Continuous embedding of atom positions on uniform grids.
Graph representations: Nodes for all heavy atoms and edges for geometric or chemical relationships.

Training data quality is crucial; semi-automated workflows (HiQBind (Wang et al., 2 Nov 2024)) and curated, "leak-proof" splits (LP-PDBBind (Li et al., 2023)) enforce strict data integrity by removing covalent complexes, supporting robust splitting by sequence/ligand/interaction similarity, and cross-matching structures with experimental binding assays (Kd, Ki, etc.).

3. Training Protocols, Losses, and Optimization

Most learnable SFs are trained under supervised regression to experimental binding affinities or classification/contrastive objectives for pose selection:

Regression losses: MSE between predicted and experimental binding free energies (ΔG, pK); sometimes hybrids between correlation and RMSE.
Classification/contrastive losses: Cross-entropy for pose discrimination (native vs. decoy), contrastive ReLU for docked vs. cross-docked or active vs. decoy pairs, triplet losses for ranking.
Force-matching: For physics-informed SFs trained on simulation data (e.g., (Brocidiacono et al., 31 May 2025)), the network is optimized to reproduce MM/MD mean forces, approximating the free energy gradient directly.

Optimization typically uses modern variants of stochastic gradient descent (Adam, momentum), regularized with L2 weight decay, early stopping, dropout, and data augmentation (atom position randomization, protein–ligand shuffling). Model ensembling is a standard practical measure for variance reduction.

4. Performance Benchmarks and Generalization

Key metrics for learnable SF evaluation include:

Scoring power: Pearson correlation (R), root-mean-square error (RMSE) between predicted and experimental affinities on benchmark sets such as CASF-2013/2016 (Zheng et al., 2019, Wang et al., 2021, Wang et al., 2021, Rana et al., 2022).
Ranking power: Cluster-based accuracy and Spearman's ρ for ligand ranking on individual targets.
Docking power: Top-N success rate, defined as the fraction of cases where a pose scored in the top N by the SF is within a low RMSD of the crystal pose; typically measured on the CASF-2016 funnel or redocking/cross-docking sets (e.g., 95.4% Top 1 success for DeepRMSD+Vina (2206.13345)).
Virtual screening enrichment: Early enrichment factors (EF_x%, e.g., Fraction of actives found in the top x% of ranked decoys) (Khamis et al., 2016, Brocidiacono et al., 31 May 2025).
Robustness to input pose: Consistency when scoring docked vs. crystal structures (Zheng et al., 2019, Wang et al., 2021).

Generalization remains the central challenge. On standard benchmarks, contemporary SFs routinely achieve R ≳ 0.80 (CASF-2016), but performance drops sharply (average R ≈ 0.47, min R ≈ 0.05) when evaluated on truly novel pockets or out-of-distribution targets under pocket-similarity-based splits (Kopko et al., 5 Dec 2025). Overly optimistic horizontal splits obscure the generalization gap, whereas vertical (leave-protein-out or pocket-cluster-based) splits reveal real-world transferability limits (Pellicani et al., 2022, Li et al., 2023).

Recent work demonstrates that leveraging self-supervised pretraining (e.g., ATOMICA) and test-target adaptation methods (early stopping, fine-tuning on small amounts of new data) can partially bridge the generalization gap (Kopko et al., 5 Dec 2025), but widespread adoption of more rigorous, leak-proof benchmarks is advocated.

5. Extensions: Flexibility, Physics, and Multi-objective Learning

Accounting for Protein Flexibility

Rigid receptor assumptions induce severe errors in classical docking. FlexVDW (Suriana et al., 2023) explicitly learns a van der Waals energy function that is tolerant of induced fit. By supervising on the minimum VDW energy over multiple holo conformers for a given ligand pose, an equivariant deep network implicitly recognizes which receptor motifs (loops, side-chains) can accommodate the ligand, improving near-native pose recovery (e.g., Top-1 hit rate increase from ~12% to ~28% in flexible cases).

Incorporating Physics and Simulations

Hybrid models directly trained on simulation data (LFM (Brocidiacono et al., 31 May 2025)) leverage force-matching to learn potentials of mean force for each target without requiring experimental binding data. This avenue permits the transfer of high-fidelity MD-derived information into ML-based virtual screening on novel targets.

Multi-objective Architectures

Frameworks such as DeepRLI (Lin et al., 19 Jan 2024) synthesize multiple objectives—affinity prediction, docking (pose selection), and screening (active–decoy discrimination)—via parallel loss terms and physical inductive priors (e.g., Vinardo-style pairwise terms). This results in “universal” SFs with competitive performance across scoring, ranking, docking, and screening tasks, particularly relevant for structure-based drug design pipelines with diverse requirements.

6. Limitations, Best Practices, and Future Directions

Current Limitations

Benchmark overfitting: Standard test sets often fail to measure real-world transfer. Data leakage inflates reported accuracy (Li et al., 2023).
Lack of explicit solvent/entropy: Most SFs ignore entropic/solvent terms, relying primarily on spatial and physicochemical encoding.
Sensitivity to pose and chemotype: Many models degrade if the input pose deviates from the crystallographic geometry or for rare scaffolds.

Best Practices

Adopting carefully curated and split datasets (LP-PDBBind (Li et al., 2023), HiQBind (Wang et al., 2 Nov 2024)).
Reporting multiple metrics (RMSE, R, EF_x%) and both average and worst-case performance.
Including rigorous OOD evaluation using pocket- or cluster-based splits, and leveraging per-target fine-tuning when possible (Kopko et al., 5 Dec 2025, Pellicani et al., 2022).
Explicitly benchmarking robustness across both redocked and cross-docked scenarios.

Promising Directions

Integration of flexible receptor models (internalized or explicit protein motions) (Suriana et al., 2023).
coupling ML scores directly to physics-based gradient information or simulation data (Brocidiacono et al., 31 May 2025).
Development of equivariant and graph-based architectures to encode 3D geometric priors (Jing et al., 2023, Lin et al., 19 Jan 2024).
Exploration of joint or multi-task learning frameworks that combine affinity, pose, and screening objectives.

In summary, learnable protein-ligand scoring functions now comprise a rich landscape of methods tailored to affinity prediction, pose selection, and screening. The field is moving toward unified architectures capable of fast, robust, and generalizable predictions, but persistent challenges in generalization and physical realism motivate ongoing development (Suriana et al., 2023, Zheng et al., 2019, Kopko et al., 5 Dec 2025, Jing et al., 2023).