A Workflow to Create a High-Quality Protein-Ligand Binding Dataset for Training, Validation, and Prediction Tasks (2411.01223v2)
Abstract: Development of scoring functions (SFs) used to predict protein-ligand binding energies requires high-quality 3D structures and binding assay data for training and testing their parameters. In this work, we show that one of the widely-used datasets, PDBbind, suffers from several common structural artifacts of both proteins and ligands, which may compromise the accuracy, reliability, and generalizability of the resulting SFs. Therefore, we have developed a series of algorithms organized in a semi-automated workflow, HiQBind-WF, that curates non-covalent protein-ligand datasets to fix these problems. We also used this workflow to create an independent data set, HiQBind, by matching binding free energies from various sources including BioLiP, Binding MOAD and BindingDB with co-crystalized ligand-protein complexes from the PDB. The resulting HiQBind workflow and dataset are designed to ensure reproducibility and to minimize human intervention, while also being open-source to foster transparency in the improvements made to this important resource for the biology and drug discovery communities.
- Lu, W.; Wu, Q.; Zhang, J.; Rao, J.; Li, C.; Zheng, S. Tankbind: Trigonometry-aware neural networks for drug-protein binding structure prediction. bioRxiv 2022, 2022–06.
- Li, J.; Guan, X.; Zhang, O.; Sun, K.; Wang, Y.; Bagni, D.; Head-Gordon, T. Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction. ArXiv 2024, arXiv:2308.09639v2.
- Liu, Z. H.; Tsanai, M.; Zhang, O.; Forman-Kay, J.; Head-Gordon, T. Computational Methods to Investigate Intrinsically Disordered Proteins and their Complexes. 2024; https://arxiv.org/abs/2409.02240.
- Fermi, E. Thermodynamics; Courier Corporation, 2012.
- Wang, Y.; Sun, K.; Li, J.; Guan, X.; Zhang, O.; Bagni, D.; Head-Gordon, T. PDBBind Optimization to Create a High-Quality Protein-Ligand Binding Dataset for Binding Affinity Prediction. 2024; https://figshare.com/collections/PDBBind_Optimization_to_Create_a_High-Quality_Protein-Ligand_Binding_Dataset_for_Binding_Affinity_Prediction/7520133/1.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.