PoseBusters: Protein–Ligand Docking Evaluation

Updated 17 September 2025

PoseBusters is a benchmark dataset and validation framework for protein–ligand docking that integrates both geometric accuracy (RMSD ≤ 2 Å) and comprehensive chemical plausibility tests.
The toolkit rigorously evaluates stereochemistry, bonding, bond lengths, planarity, and energy plausibility to ensure that only physically realistic binding poses, termed PB-valid, are accepted.
Its application in both classical and AI-driven docking workflows has highlighted the benefits of hybrid approaches, where post-docking energy minimization improves PB-valid rates and overall model reliability.

PoseBusters denotes a collection of biochemically and physically rigorous benchmark datasets and validation frameworks for protein–ligand docking, together with an associated toolkit that systematically evaluates the chemical and energetic plausibility of predicted binding poses. Originally introduced as part of a comprehensive assessment paradigm for deep learning-based docking algorithms, PoseBusters shifts the evaluation focus away from conventional RMSD-only criteria toward dual geometric and physical validity metrics, providing a robust external standard for both classical and AI-driven docking workflows.

1. Dataset Structure and Validation Criteria

The PoseBusters dataset is specifically composed of protein–ligand complexes released after 2021 to ensure the assessment of model generalization to novel structures (Sarigun et al., 24 Jun 2025). One established configuration includes 428 complexes of drug-like molecules, while a benchmark subset introduced for comprehensive method comparison comprises 308 recently released complexes not present in standard PDB training splits (Morehead et al., 2024, Balytskyi et al., 4 Feb 2025). The primary emphasis is on geometric accuracy (usually defined as heavy-atom symmetry-aware RMSD ≤ 2 Å) and a suite of chemical plausibility tests, as implemented in the PoseBusters toolkit (Buttenschoen et al., 2023).

The validation framework systematically checks:

Stereochemistry and bonding: Conservation of molecular formula, connectivity, tetrahedral chirality, and double bond configuration, evaluated via InChI string matching.
Bond lengths and angles: Must fall within bounds of $[0.75, 1.25]$ times reference values from distance geometry.
Planarity: All atoms of aromatic rings or double bonds must lie within 0.25 Å of the best-fit plane.
Intramolecular clashes: Minimum heavy atom distances must exceed 0.75× the sum of van der Waals radii.
Energy plausibility: An energy ratio (pose UFF energy divided by mean ETKDG-conformer energies) threshold of 100 is used to flag overly strained poses.
Intermolecular overlap: The volume overlap of the ligand with protein/cofactor must not exceed 7.5 % for scaled van der Waals volumes.

Only complexes passing all these criteria are denoted “PB-valid,” representing physically realistic binding conformations.

2. Role in Protein–Ligand Docking Evaluation

The PoseBusters dataset and validation protocols are central to rigorous benchmarking of both classical (AutoDock Vina, GOLD, Smina) and deep learning-based docking algorithms (DiffDock, EquiBind, TankBind, FlowDock, Uni-Mol, Gnina) (Buttenschoen et al., 2023, Morehead et al., 2024, Sarigun et al., 24 Jun 2025, Khiari et al., 16 Sep 2025). Traditional RMSD-based evaluation, while standard, fails to penalize physically unrealistic predictions; thus, PoseBusters introduces a dual metric—geometric correctness and chemical validity—which has revealed that deep learning-based methods, despite strong RMSD scores, often generate chemically implausible poses.

Comparative evaluations consistently show:

Classical methods (e.g., AutoDock Vina, GOLD, Smina) yield higher PB-valid rates than most purely deep learning approaches.
Hybrid strategies, where deep learning predictions undergo post-docking energy minimization (using AMBER ff14sb or Sage in OpenMM), significantly improve PB-valid rates for AI models.
Benchmark rates: For example, PocketVina achieves ~65.65 % PB-valid success (i.e., RMSD ≤ 2 Å and passing all physical checks) on the 428-complex PoseBusters set, outperforming several learning-based competitors (Sarigun et al., 24 Jun 2025).

3. Integration with Docking Workflows and Model Development

In current docking pipelines, PoseBusters serves both as a retrospective performance benchmark and a prospective filter for candidate pose selection:

Post-docking refinement (energy minimization) improves PB-valid outcomes, suggesting force field-based physics remain underrepresented in neural methodologies.
Pocket finding algorithms (e.g., RAPID-Net) integrated with docking engines (AutoDock Vina) demonstrate that improved search grid definition via predicted pockets delivers gains in PB-valid rates, especially for large proteins or remote binding sites (Balytskyi et al., 4 Feb 2025).
The dataset is also used to validate new synthetic protein–ligand complex generation pipelines, providing a standard against which the utility of synthetic training data in docking model retraining can be empirically judged (Khiari et al., 16 Sep 2025).

4. Metrics and Quantitative Evaluation

PoseBusters enables fine-grained analysis of docking performance, going beyond global accuracy statistics. Evaluation protocols typically include:

Metric	Definition	Success Threshold
RMSD	$\sqrt{(1/N) \sum_{i=1}^N \\|r_i^{true} - r_i^{pred}\\|^2}$	≤ 2 Å, ≤ 5 Å
Bond/Angle tolerances	Bond lengths/angles: $[0.75, 1.25] \times$ reference	Within interval
Energy ratio	$(\text{Docked Energy}) /\langle \text{Ensemble Energy} \rangle$	≤ 100
Volume overlap	Fraction of ligand/protein van der Waals volumes overlapped	≤ 7.5 %

Only complexes that satisfy all criteria are classified as physically valid (PB-valid).

5. Applications and Impact

PoseBusters underpins several advances in structure-based drug discovery and deep molecular docking:

Enables strict assessment of pose plausibility, reducing the incidence of chemically dubious predictions in virtual screening pipelines.
Provides an external benchmark for generalization evaluation on structurally diverse and novel targets, eliminating risks of training/test leakage prevalent in pre-2021 datasets.
Facilitates comparative and ablation studies of docking models retrained with experimental vs. synthetic data, showing that synthetic complexes can recover much of the performance of experimental data, though binding pocket similarity remains pivotal (Khiari et al., 16 Sep 2025).
Influences model development, suggesting the need for explicit incorporation of physical constraints (prior knowledge from molecular mechanics, force field physics) into deep learning architectures for more robust and generalizable predictions.

6. Recent Extensions: PoseBusters in Human Motion Evaluation

The PoseBusters name also refers to an independently developed isometric exercise video dataset and evaluation framework for computer vision–based exercise feedback (Jaiswal et al., 13 Jun 2025). This collection comprises over 3,600 tagged videos spanning six multiclass isometric poses, annotated for correct and common mistake variants. Associated benchmarks include angle-based and graph-convolution network classifiers, and a three-part evaluation metric covering classification accuracy, mistake localization, and model confidence. Applications extend beyond fitness to rehabilitation, physiotherapy, and intelligent exercise feedback systems.

A plausible implication—given the biophysical rigor encoded in the protein–ligand PoseBusters framework—is that such protocols may cross-fertilize model validation strategies in other pose-related domains.

7. Significance and Future Directions

The establishment of PoseBusters as a community-standard benchmark has underscored the limitations of single-metric evaluation in model-driven molecular docking. As the field transitions toward hybrid workflows integrating AI-driven pose proposal with post-hoc physics-based filtering or refinement, the PoseBusters criteria will likely form the basis of downstream validation, rescoring, and candidate selection in large-scale virtual screening and structure-based design. Continued expansion of the dataset, with additional physical plausibility tests or coverage of new chemical/biological subspaces, can further drive reliability and model transparency in computational structural biology and cheminformatics.