CASF-2016 Benchmark Overview
- CASF-2016 Benchmark is a standardized platform that quantifies scoring, ranking, docking, and screening capabilities in protein-ligand binding models.
- It uses a rigorously curated core set from PDBbind with strict protocols and no-leakage measures to ensure unbiased performance comparisons.
- The benchmark has catalyzed innovative approaches integrating graph-based descriptors, deep learning, and dynamic features in structure-based drug design.
The Comparative Assessment of Scoring Functions 2016 (CASF-2016) Benchmark is a rigorous community standard designed to evaluate the predictive accuracy of scoring functions for protein–ligand binding affinity, ranking, docking, and screening power within the structure-based drug design landscape. Built upon the PDBbind dataset's "core set," CASF-2016 unifies dataset composition, evaluation protocols, and statistical metrics, enabling reproducible comparison of machine learning, deep learning, and empirical scoring methodologies at multiple levels of structural and chemical granularity. Since its establishment, CASF-2016 has driven rapid advances in both model architecture and feature engineering, catalyzing the development of state-of-the-art algorithms for in silico lead optimization and virtual screening.
1. Dataset Construction and Evaluation Protocols
CASF-2016 is structured around a core test set derived from PDBbind v2016, typically comprising either 285 or 2,850 high-quality protein–ligand complexes, depending on the study (Rana et al., 2023, &&&1&&&). Selection criteria for inclusion require sub-2.5 Å resolution, experimentally measured affinities (pK_d or pK_i), chemical diversity among ligands, and target diversity (57 distinct protein families). Each target protein is represented by five ligands that span a 4–5 log-unit affinity range (Min et al., 2022).
Standard practice holds out this core set for external validation, while model selection, early stopping, and hyperparameter tuning are performed using the “refined” or “general” subsets of PDBbind after removing all core complexes (Wang et al., 2021, Nguyen et al., 2018). Validation and training splits are either random or stratified according to chemical/structural features. Strict “no-leakage” protocols remove any complexes from training that exceed soft similarity thresholds (e.g., protein sequence identity >90% and ligand ECFP4 Tanimoto >0.9) with any member of the core set (Gao et al., 12 Jan 2026).
2. Benchmark Tasks: The Four "Powers" and Their Metrics
CASF-2016 establishes four principal axes for scoring function evaluation:
- Scoring Power: Measures correlation between predicted and experimental binding affinities across unrelated targets. Formal metrics are:
- Pearson’s correlation coefficient :
- Root-mean-square error (RMSE):
- Mean absolute error (MAE) as an auxiliary metric (Rana et al., 15 Sep 2025, Gao et al., 12 Jan 2026).
- Ranking Power: Assesses the ability to recover the correct ordinal ranking of ligands per target protein, via Spearman’s rank correlation () and, less commonly, Kendall’s τ and Predictive Index (PI):
where is the difference between experimental and predicted ranks (Min et al., 2022, Gao et al., 12 Jan 2026).
- Docking Power: Evaluates how often a scoring function ranks the crystallographic pose (or near-native pose with RMSD < 2 Å) as top among 100 docked decoys. Metrics include top-1, top-2, and top-3 success rates (Gao et al., 12 Jan 2026):
- Screening Power: Quantifies enrichment of true binding ligands over decoy in virtual screening, commonly via enrichment factor (EF) and area under ROC curve, though many ML-focused studies report only scoring or ranking power (Rana et al., 2023).
3. Feature Engineering Paradigms
CASF-2016 has fostered diverse representations:
- Multiscale Weighted Colored Subgraphs (MWCG): Each protein–ligand complex is parsed into atom–type pairs weighted by radial basis kernels (), capturing both short- and longer-range cross-molecular interactions. Atom types are defined by element, hybridization, aromaticity, or extended connectivity (e.g., SYBYL atom types in sybylGGL-Score) (Rana et al., 2023). Weighted subgraph features are highly compositional; fine-grained atom typing (e.g., SYBYL) correlates with superior scoring power.
- Radial Distance-Shell Encodings: Residue–atom contact counts within concentric spherical shells (OnionNet-2) provide rotation-invariant spatial fingerprints, with optimal shell depth empirically determined (N=62 for OnionNet-2) (Wang et al., 2021).
- Algebraic Graph Spectra: Graph Laplacian eigenvalues, adjacency spectra, and related invariants encode local rigidity, flexibility, and global topology (e.g., AGL-Score) (Nguyen et al., 2018).
- Contact Maps and Probabilistic Mixture Models: MDN-based models (e.g., PLANET v2.0) predict pairwise distance distributions and the associated energetic contributions, enabling direct biophysical interpretability (Gao et al., 12 Jan 2026).
- Dynamic Structural Ensembles: Incorporation of MD-derived networks into geometric GNNs (Dynaformer), enabling direct modeling of enthalpic and entropic binding determinants (Min et al., 2022).
4. Representative Algorithms and Quantitative Performance
The evolution of CASF-2016 models is reflected in rising performance metrics and algorithmic innovations.
| Method | Pearson r | RMSE (pK/kcal/mol) | Spearman ρ | Docking SR@1 (%) | Notes |
|---|---|---|---|---|---|
| sybylGGL-Score (Rana et al., 2023) | 0.873 | Not given | Not given | Not given | MWCG, SYBYL atom types |
| Dynaformer (Min et al., 2022) | 0.858 | 1.114 (pK) | 0.865 | – | MD ensemble, Graphormer GNN |
| DeepGGL (Rana et al., 15 Sep 2025) | 0.868 | 1.150 (pK) | – | – | Bipartite subgraph, attn |
| OnionNet-2 (Wang et al., 2021) | 0.864 | 1.164 | – | – | Shell contacts, CNN |
| PLANET v2.0 (Gao et al., 12 Jan 2026) | 0.848 | 1.171 (pK) | 0.669 | 85.2 | MDN, multi-objective GNN |
| AGL-Score (Nguyen et al., 2018) | 0.835 | 1.732 | – | – | Algebraic graph invariants |
- sybylGGL-Score employs chemically specific atom typing and achieves the highest reported Pearson r on the 2,850-core test set (r=0.873).
- Dynaformer, based on dynamic structural data and a Graphormer backbone, attains r=0.858, RMSE=1.114, and Spearman ρ=0.865 on 285 complexes.
- PLANET v2.0 shows strong performance across all three powers: scoring (r=0.848, RMSE=1.171), ranking (ρ=0.669), and docking (SR@1=85.2%).
- DeepGGL sets r=0.868, RMSE=1.150, leveraging multiscale bipartite subgraphs and attention.
- OnionNet-2 (r=0.864, RMSE=1.164) encodes residue–atom shell counts through a compact CNN.
- AGL-Score demonstrates competitive scoring ability with r=0.835, RMSE=1.732, based on spectral invariants.
Absolute performance increments between top models are on the order of 0.004–0.007 in Pearson r or 0.014–0.04 reduction in RMSE (pK units) (Rana et al., 15 Sep 2025, Wang et al., 2021).
5. Model Design Principles: Graph Representations and Learning Strategies
CASF-2016 has promoted several universal design choices:
- Graph-based Features: Atom- and residue-level graphs with edge weights from generalized exponential or Lorentz kernels are standard (Rana et al., 2023, Rana et al., 15 Sep 2025, Nguyen et al., 2018).
- Multi-scale Kernels: Models incorporating both short- and medium-range descriptors consistently outperform single-scale analogues (Rana et al., 2023).
- Mixture Density Networks: PLANET v2.0’s MDN head parameterizes both interatomic distance and distance–energy profiles, integrating these into the final binding affinity via analytical expectation over learned distributions (Gao et al., 12 Jan 2026).
- Multi-objective Training: Simultaneous minimization of affinity loss, geometry reconstruction, interatomic distance fitting, and decoy discrimination is exemplified by PLANET v2.0, enforcing physical plausibility and robust generalization (Gao et al., 12 Jan 2026).
- Attention Mechanisms: DeepGGL and Dynaformer employ attention to focus model capacity on spatially and chemically salient interactions (Rana et al., 15 Sep 2025, Min et al., 2022).
- Ensemble and Data Augmentation: Top models often utilize ensemble predictions (averaging over multiple runs) and adversarial augmentation (FLAG in Dynaformer) to stabilize and improve metrics (Min et al., 2022).
6. Strengths, Limitations, and Implications for Drug Discovery
Strengths:
- CASF-2016’s core set with fixed protocols has standardized the comparison of increasingly sophisticated graph-based, kernel, and deep learning scoring functions.
- Models with chemically granular atom typing (e.g., SYBYL) and explicit modeling of dynamic ensembles (Dynaformer) have set new benchmarks for affinity prediction accuracy.
- The design of multi-objective architectures (e.g., PLANET v2.0) enables simultaneous optimization of scoring, ranking, and docking power—enhancing practical virtual screening applications (Gao et al., 12 Jan 2026).
Limitations:
- A significant fraction of studies focus on scoring power exclusively, sometimes neglecting docking and screening power, as in sybylGGL-Score (Rana et al., 2023).
- Most models presume known crystallographic poses, limiting direct applicability to realistic docking scenarios unless specific pose-recovery modules are integrated (Nguyen et al., 2018, Wang et al., 2021).
- Reported improvements are sometimes modest in absolute terms, suggesting diminishing returns from further incremental changes; robust statistical tests and large-scale prospective validations remain critical.
A plausible implication is that integration of chemically detailed graph-based descriptors, explicit dynamic sampling, and multi-task objectives, together with rigorous CASF-2016 evaluation, will continue driving state-of-the-art advances in scoring functions for structure-based drug design.
7. Future Directions
The trajectory of CASF-2016-driven research suggests several converging trends:
- Expansion of atom-typing strategies and kernel-based descriptors to encode finer physicochemical detail.
- Routine incorporation of dynamic structural information (as in Dynaformer) to capture conformational entropy.
- Advancement of unified, task-general networks (such as PLANET v2.0) capable of top-tier performance across all four CASF-2016 powers (Gao et al., 12 Jan 2026).
- Systematic benchmarking on larger, more chemically diverse, and prospectively validated datasets to further enhance real-world prediction reliability.
In sum, CASF-2016 remains the gold standard for quantifying the progress of affinity prediction and docking models, and it has directly shaped both algorithmic innovation and evaluation culture in computational drug discovery (Rana et al., 2023, Gao et al., 12 Jan 2026, Rana et al., 15 Sep 2025, Wang et al., 2021, Nguyen et al., 2018, Min et al., 2022).