Deep Confident Steps to New Pockets: Strategies for Docking Generalization (2402.18396v1)
Abstract: Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.
- Rl-mlzerd: Multimeric protein docking using reinforcement learning. Frontiers in Molecular Biosciences, 9, 2022.
- Fast, accurate, and reliable molecular docking with QuickVina 2. Bioinformatics, 31, 2015.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
- Announcing the worldwide Protein Data Bank. Nat Struct Biol, 10, 2003.
- Posebusters: Ai-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chemical Science, 2024.
- Ecod: an evolutionary classification of protein domains. PLoS computational biology, 2014.
- UniProt Consortium. Uniprot: a hub for protein information. Nucleic acids research, 2015.
- Gabriele Corso. Modeling molecular structures with intrinsic diffusion models. arXiv preprint arXiv:2302.12255, 2023.
- Diffdock: Diffusion steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 2022.
- Riemannian score-based generative modeling. arXiv preprint arXiv:2202.02763, 2022.
- Computational protein–ligand docking and virtual drug screening with the autodock suite. Nature protocols, 11(5):905–919, 2016.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- An open-source drug discovery platform enables ultra-large virtual screens. Nature, 2020.
- Glide: a new approach for rapid, accurate docking and scoring. 2. enrichment factors in database screening. Journal of medicinal chemistry, 2004.
- Protein-ligand blind docking using quickvina-w with inter-process spatio-temporal integration. Scientific Reports, 7, 2017.
- Binding moad (mother of all databases). Proteins: Structure, Function, and Bioinformatics, 2005.
- Ajay N Jain. Surflex: fully automatic flexible molecular docking using a molecular similarity-based search engine. Journal of medicinal chemistry, 46, 2003.
- Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624, 2021.
- Torsional diffusion for molecular conformer generation. arXiv preprint arXiv:2206.01729, 2022.
- Development and validation of a genetic algorithm for flexible docking. Journal of molecular biology, 267(3):727–748, 1997.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Lessons learned in empirical scoring with smina from the csar 2011 benchmarking exercise. Journal of chemical information and modeling, 53, 2013.
- Generalized biomolecular modeling and design with rosettafold all-atom. bioRxiv, pp. 2023–10, 2023.
- P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. Journal of cheminformatics, 10(1):1–12, 2018.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. arXiv, 2022.
- Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research, 2007.
- Forging the basis for developing protein–ligand interaction scoring functions. Accounts of Chemical Research, 50, 2017.
- Tankbind: Trigonometry-aware neural networks for drug-protein binding structure prediction. Advances in neural information processing systems, 2022.
- Molecular docking for prediction and interpretation of adverse drug reactions. Combinatorial Chemistry & High Throughput Screening, 2018.
- Fusiondock: Physics-informed diffusion model for molecular docking. 2023.
- Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, 2006.
- Gnina 1.0: molecular docking with deep learning. Journal of cheminformatics, 13, 2021.
- spyrmsd: symmetry-corrected rmsd calculations in python. Journal of Cheminformatics, 12, 2020.
- A geometric deep learning approach to predict binding conformations of bioactive molecules. Nature Machine Intelligence, 3(12):1033–1039, 2021.
- Mayukh Mukhopadhyay. A brief survey on bio inspired optimization algorithms for molecular docking. International Journal of Advances in Engineering & Technology, 2014.
- A defined structural unit enables de novo design of small-molecule–binding proteins. Science, 369, 2020.
- From drugs to targets: Reverse engineering the virtual screening process on a proteomic scale. Frontiers in Drug Discovery, 2022.
- Mastering the game of go with deep neural networks and tree search. Nature, 529, 2016.
- Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
- Equibind: Geometric deep learning for drug binding structure prediction. In International Conference on Machine Learning. PMLR, 2022.
- René Thomsen. Protein–ligand docking with evolutionary algorithms. Computational Intelligence in Bioinformatics, 2007.
- Moldock: a new technique for high-accuracy molecular docking. Journal of medicinal chemistry, 49, 2006.
- A reinforcement learning approach for protein–ligand binding pose prediction. BMC bioinformatics, 23, 2022.
- De novo design of protein structure and function with rfdiffusion. Nature, 2023.
- Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
- Ligands binding and molecular simulation: the potential investigation of a biosensor based on an insect odorant binding protein. International journal of biological sciences, 2015.
- Uni-mol: a universal 3d molecular representation learning framework. 2023.