CASF-2016: Comparative Scoring Function Benchmark

Updated 4 February 2026

The paper establishes a systematic benchmarking framework that rigorously evaluates scoring, ranking, docking, and screening powers using standardized protein–ligand complexes from PDBbind.
It compares classical, ML-enhanced, and hybrid scoring functions by quantifying performance with metrics such as Pearson’s R, Top-1 docking success, and enrichment factors.
The study provides actionable insights for optimizing virtual screening protocols in drug discovery while highlighting challenges like training–test similarity and generalization on novel chemistries.

The Comparative Assessment of Scoring Functions 2016 (CASF-2016) is a systematic benchmarking framework designed to provide standardized, multi-dimensional evaluation of protein–ligand scoring functions for structure-based drug design and virtual screening. It emphasizes rigorous comparison across multiple facets of performance—scoring, ranking, docking, and screening power—using unified data, pose sets, and protocols. CASF-2016 has become the community reference for reporting and analyzing the efficacy of both classical and machine learning–based scoring functions, enabling direct, fair assessment of methodological advances in the field.

1. Benchmark Structure and Evaluation Metrics

CASF-2016 encompasses four principal “powers,” each quantified by specific metrics and standardized datasets. The test platform consists of 285 nonredundant protein–ligand complexes and associated decoy poses, derived from the experimentally curated PDBbind database. The evaluation metrics are as follows:

Scoring Power: Measures the Pearson correlation coefficient ( $R$ ) between predicted and experimental binding affinities ( $\Delta G_\mathrm{bind}$ ) using the crystal (native) pose. Supplementary metrics include mean absolute error (MAE) and root-mean-square error (RMSE).
Ranking Power: Assesses the ability to order congeneric ligands by predicted affinity for the same target, typically using Spearman’s rank correlation coefficient ( $\rho$ ), averaged over 57 protein targets.
Docking Power: Quantifies the success rate ( $S(n)$ ) at identifying near-native poses (RMSD $<$ 2 Å) as the top- $n$ ranked predictions among up to 100 decoys per complex. Top-1, Top-2, and Top-3 rates are standard.
Screening Power: Calculates early enrichment of true binders in large decoy libraries, reporting enrichment factor at 1% (EF $_{1\%}$ ) and screening success rate @1%.

Rigorous test–train splits, standardized pose generation (including RMSD computation via a symmetry-corrected Hungarian algorithm), and fair statistical protocols ensure performance comparability across methods (2206.13345, Moon et al., 2020, Gao et al., 12 Jan 2026, Wang et al., 2021).

2. Major Classes of Scoring Functions and Methodological Advances

CASF-2016 has underpinned the comparative evolution of scoring functions across multiple paradigms:

Classical Empirical/Physics-based Scores: Examples include AutoDock Vina and X-Score, combining analytic van der Waals, hydrogen bonding, and hydrophobic surface terms (Wang et al., 2021).
Random Forest and ML-enhanced Empirical Scores: ΔVinaRF20 augments Vina-style atom-pair features with a random forest regressor to improve both docking and scoring power (Wang et al., 2021).
3D Convolutional Neural Networks: AK-Score ensembles and ResAtom-Score deploy deep residual or convolutional architectures on 3D voxelized representations, enabling direct spatial feature extraction of interaction patterns (Wang et al., 2021).
Graph Neural Networks with Physics-informed Modules: PIGNet hierarchically decomposes the binding affinity using physics-inspired equations parameterized by graph nets, introducing inductive bias for enhanced generalization and interpretability (Moon et al., 2020).
Mixture Density and Statistical Potentials: PLANET v2.0 leverages Mixture Density Networks (MDNs) to model distance and energy distributions of residue–atom pairs, enabling differentiable statistical potentials for pose evaluation and affinity prediction (Gao et al., 12 Jan 2026).
Differentiable Hybrid Optimization Frameworks: The DeepRMSD+Vina hybrid score fuses a neural predicted RMSD proxy and the differentiable Vina score, integrated directly into end-to-end gradient-based pose optimization (2206.13345).

Each class is evaluated on the same CASF-2016 standard, facilitating transparent performance tracking.

3. Quantitative Performance and Comparative Analysis

Direct comparison of leading methods on CASF-2016 reveals distinct trends. The following table summarizes principal metrics:

Method	Scoring Power (R)	Docking Power (Top 1)	Screening EF $_{1\%}$
ResAtom-Score (ens.)	0.833	N/A (not for pose selection)	~1.09 (MAE: 1.09)
ΔVinaRF20	0.816	89.1%	1.13
PLANET v2.0	0.848	85.2%	N/A on CASF-2016
PIGNet (ensemble)	0.761	87.0%	19.6
DeepRMSD+Vina	N/A	95.4%	N/A
AutoDock Vina	0.604–0.565	84.6–90.2%	7.7
X-Score	0.631	63.5%	1.46

DeepRMSD+Vina sets the current docking power benchmark (Top 1: 95.4%), outperforming both classical and ML-based methods, including 3D-CNN pose predictors and ensemble random-forest approaches (2206.13345). ResAtom-Score achieves the highest documented scoring power (R = 0.833) on native poses, with the ΔVinaRF20/ResAtom-Score pipeline demonstrating near-native performance in the absence of crystal poses (Wang et al., 2021). PIGNet excels in both docking and screening power, yielding EF $_{1\%}$ $\sim$ 19.6, substantially exceeding classical methods (Moon et al., 2020). PLANET v2.0 matches or exceeds state-of-the-art in scoring and docking power (R = 0.848, Top 1: 85.2%) on CASF-2016, with stringent exclusion of training overlap and rapid evaluation (Gao et al., 12 Jan 2026).

4. Architectural Innovations and Regularization Strategies

Significant progress in scoring function performance on CASF-2016 is driven by methodological advances in architectural design and optimization:

3D-CNN and Attention: ResAtom-Score’s ResNet backbone with channel–spatial attention and voxelized electron-density channels focuses on salient interaction features and shape complementarity (Wang et al., 2021).
Physics-Informed Parameterization: PIGNet encodes four key interaction terms (van der Waals, hydrogen bond, metal, hydrophobic) via neural network–parameterized functions, combined with an entropic rotor penalty. Differentiated loss penalties and derivative regularization ensure sharper binding-pose wells and improved generalization under pose and ligand augmentations (Moon et al., 2020).
Mixture Density and Multi-Task Supervision: PLANET v2.0’s MDN outputs residue–atom contact likelihoods and energy mixtures, linking microstate potentials to macroscopic affinity predictions via mathematical expectation. Auxiliary tasks (decoy discrimination, atom/residue recovery) improve feature learning beyond affinity regression (Gao et al., 12 Jan 2026).
End-to-End Differentiable Scoring: The DeepRMSD+Vina framework’s MLP is fed 1,470-dimensional structural encodings and output is combined with Vina energy, enabling gradient-based ligand pose optimization via automatic differentiation (2206.13345).
Uncertainty Quantification: PIGNet employs Monte Carlo dropout to estimate epistemic uncertainty in predictions; this filtering improves screening reliability (Moon et al., 2020).

Data augmentation, strict test/train separation (e.g., exclusion of “soft overlap” in PLANET v2.0), and bootstrapping are frequently used for robust generalization and uncertainty estimation.

5. Practical Applications and Limitations

CASF-2016 outcomes have direct implications for virtual screening and drug design workflows. Protocols such as Vina pose generation followed by ΔVinaRF20 ranking and ResAtom-Score rescoring yield highly accurate affinity predictions even without experimentally determined conformations (Wang et al., 2021). Hybrid optimization frameworks such as DeepRMSD+Vina deliver substantial gains (+15% Top 1) in redocking and cross-docking, directly translating to improved hit rates in real-world campaigns (2206.13345).

A key limitation is method-dependent sensitivity to training–test similarity and generalization on novel chemistries or protein folds. PLANET v2.0 shows reduced scoring power on PDBbind v2024 new entries (Pearson R $\Delta G_\mathrm{bind}$ 0 0.70), while its screening enrichment is modest versus methods like PIGNet, partly due to molecular size bias (Gao et al., 12 Jan 2026, Moon et al., 2020). Some ML-based functions (ResAtom-Dock, PIGNet) alone provide lower docking power than the best empirical or hybrid scores and are therefore often deployed in tandem with high-performing pose selectors (Wang et al., 2021, Moon et al., 2020).

Screening power is often reported using proxy datasets (e.g., LIT-PCBA), since the CASF-2016 compound libraries remain small relative to real-world screens (Gao et al., 12 Jan 2026).

6. Interpretability and Future Perspectives

Physics-informed and interpretable scoring functions are increasingly viable. PIGNet, for instance, enables attribution of predicted affinity changes down to ligand substructures, leveraging learnable distance corrections that recapitulate subtle chemical effects observed experimentally (Moon et al., 2020). PLANET v2.0’s statistical energy module models both favorable and unfavorable interactions, improving interpretation of structure–activity relationships (Gao et al., 12 Jan 2026).

The evolving landscape, as evidenced on CASF-2016, is toward:

Greater physical grounding in deep models.
Explicit uncertainty quantification.
Multi-objective training for robustness to pose and chemical perturbations.
Hybrid differentiable frameworks for direct pose optimization.

This suggests future scoring function development will continue to integrate data-driven predictive accuracy, physical interpretability, and decision-theoretic reliability, with CASF-2016 (and its successors) remaining central as community benchmarks.