Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Confident Steps to New Pockets: Strategies for Docking Generalization (2402.18396v1)

Published 28 Feb 2024 in q-bio.BM and cs.LG

Abstract: Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Rl-mlzerd: Multimeric protein docking using reinforcement learning. Frontiers in Molecular Biosciences, 9, 2022.
  2. Fast, accurate, and reliable molecular docking with QuickVina 2. Bioinformatics, 31, 2015.
  3. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  4. Announcing the worldwide Protein Data Bank. Nat Struct Biol, 10, 2003.
  5. Posebusters: Ai-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chemical Science, 2024.
  6. Ecod: an evolutionary classification of protein domains. PLoS computational biology, 2014.
  7. UniProt Consortium. Uniprot: a hub for protein information. Nucleic acids research, 2015.
  8. Gabriele Corso. Modeling molecular structures with intrinsic diffusion models. arXiv preprint arXiv:2302.12255, 2023.
  9. Diffdock: Diffusion steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022.
  10. Robust deep learning–based protein sequence design using proteinmpnn. Science, 2022.
  11. Riemannian score-based generative modeling. arXiv preprint arXiv:2202.02763, 2022.
  12. Computational protein–ligand docking and virtual drug screening with the autodock suite. Nature protocols, 11(5):905–919, 2016.
  13. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  14. An open-source drug discovery platform enables ultra-large virtual screens. Nature, 2020.
  15. Glide: a new approach for rapid, accurate docking and scoring. 2. enrichment factors in database screening. Journal of medicinal chemistry, 2004.
  16. Protein-ligand blind docking using quickvina-w with inter-process spatio-temporal integration. Scientific Reports, 7, 2017.
  17. Binding moad (mother of all databases). Proteins: Structure, Function, and Bioinformatics, 2005.
  18. Ajay N Jain. Surflex: fully automatic flexible molecular docking using a molecular similarity-based search engine. Journal of medicinal chemistry, 46, 2003.
  19. Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624, 2021.
  20. Torsional diffusion for molecular conformer generation. arXiv preprint arXiv:2206.01729, 2022.
  21. Development and validation of a genetic algorithm for flexible docking. Journal of molecular biology, 267(3):727–748, 1997.
  22. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  23. Lessons learned in empirical scoring with smina from the csar 2011 benchmarking exercise. Journal of chemical information and modeling, 53, 2013.
  24. Generalized biomolecular modeling and design with rosettafold all-atom. bioRxiv, pp.  2023–10, 2023.
  25. P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. Journal of cheminformatics, 10(1):1–12, 2018.
  26. Language models of protein sequences at the scale of evolution enable accurate structure prediction. arXiv, 2022.
  27. Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research, 2007.
  28. Forging the basis for developing protein–ligand interaction scoring functions. Accounts of Chemical Research, 50, 2017.
  29. Tankbind: Trigonometry-aware neural networks for drug-protein binding structure prediction. Advances in neural information processing systems, 2022.
  30. Molecular docking for prediction and interpretation of adverse drug reactions. Combinatorial Chemistry & High Throughput Screening, 2018.
  31. Fusiondock: Physics-informed diffusion model for molecular docking. 2023.
  32. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, 2006.
  33. Gnina 1.0: molecular docking with deep learning. Journal of cheminformatics, 13, 2021.
  34. spyrmsd: symmetry-corrected rmsd calculations in python. Journal of Cheminformatics, 12, 2020.
  35. A geometric deep learning approach to predict binding conformations of bioactive molecules. Nature Machine Intelligence, 3(12):1033–1039, 2021.
  36. Mayukh Mukhopadhyay. A brief survey on bio inspired optimization algorithms for molecular docking. International Journal of Advances in Engineering & Technology, 2014.
  37. A defined structural unit enables de novo design of small-molecule–binding proteins. Science, 369, 2020.
  38. From drugs to targets: Reverse engineering the virtual screening process on a proteomic scale. Frontiers in Drug Discovery, 2022.
  39. Mastering the game of go with deep neural networks and tree search. Nature, 529, 2016.
  40. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  41. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  42. Equibind: Geometric deep learning for drug binding structure prediction. In International Conference on Machine Learning. PMLR, 2022.
  43. René Thomsen. Protein–ligand docking with evolutionary algorithms. Computational Intelligence in Bioinformatics, 2007.
  44. Moldock: a new technique for high-accuracy molecular docking. Journal of medicinal chemistry, 49, 2006.
  45. A reinforcement learning approach for protein–ligand binding pose prediction. BMC bioinformatics, 23, 2022.
  46. De novo design of protein structure and function with rfdiffusion. Nature, 2023.
  47. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  48. Ligands binding and molecular simulation: the potential investigation of a biosensor based on an insect odorant binding protein. International journal of biological sciences, 2015.
  49. Uni-mol: a universal 3d molecular representation learning framework. 2023.
Citations (14)

Summary

  • The paper introduces the DockGen benchmark, a rigorous tool for evaluating ML models' generalization to unseen protein binding pockets.
  • The paper presents Confidence Bootstrapping, a self-training approach combining diffusion and confidence models to enhance docking accuracy.
  • The findings demonstrate that scaling data and models, along with synthetic augmentation, significantly improve blind docking prediction success.

Enhancing Blind Docking Generalizability with DockGen Benchmark and Confidence Bootstrapping

Introduction

In the field of drug discovery, the accurate prediction of how small molecules and proteins interact through molecular docking holds immense potential for advancing biological research and therapeutic development. A particular challenge within this domain is the task of blind docking, where the interaction site on the protein is not predefined. Successfully addressing the general blind docking task can accelerate drug development, predict adverse drug reactions, and shed light on the functions of numerous proteins currently beyond our understanding. However, the generalization of ML-based docking methods to unseen protein classes remains a significant obstacle, as existing benchmarks inadequately assess these models' ability to generalize across the proteome.

The DockGen Benchmark

To tackle the limitations of current benchmarks, this paper introduces DockGen, a new benchmark designed specifically to evaluate the generalization capacities of docking methods across different protein domains. DockGen leverages the ligand-binding domain classification to generate a more challenging and representative benchmark. It uncovers the stark reality that existing ML-based docking models display notably weak generalizability when subjected to unseen binding pockets. The deployment of the DockGen benchmark facilitates a detailed analysis of the DiffDock method, revealing that augmentations in data volume, model scale, and the integration of synthetic datasets significantly boost the model's ability to generalize, setting a new standard in ML-based docking performance.

Confidence Bootstrapping for Enhanced Docking Accuracy

Building upon these insights, the paper introduces a novel training paradigm termed Confidence Bootstrapping. This method capitalizes on the synergistic interaction between diffusion models and confidence models within a self-training framework. By iterating the process of generating docking poses and refining the model based on the confidence scores of these poses, Confidence Bootstrapping significantly improves the model's proficiency in docking to previously unseen protein classes. The empirical results showcase a marked improvement in the success rates of docking predictions, highlighting the method's effectiveness in bridging the generalization gap for blind docking.

Theoretical and Practical Implications

The development and validation of DockGen and the Confidence Bootstrapping method hold profound implications for both theoretical advancements and practical applications within the field of molecular docking. Theoretically, the analysis underscores the critical role of data and model scaling, as well as the value of synthetic data generation, in enhancing the generalization abilities of ML-based docking methods. Practically, the paper's contributions present a viable path forward in addressing the pivotal challenge of blind docking, potentially revolutionizing the drug discovery process by enabling precise and generalizable docking predictions across the entire proteome.

Future Outlook

Looking ahead, the findings and methodologies presented in this work pave the way for further innovations in the field of molecular docking and generative modeling. The remarkable improvement in docking accuracy achieved through Confidence Bootstrapping highlights the untapped potential of self-training schemes in refining ML models for complex tasks. Moreover, the introduction of DockGen as a stringent benchmark invites ongoing efforts to develop more sophisticated and capable models that can rise to the benchmark's challenges. As the field progresses, leveraging larger datasets, advancing model architectures, and exploring novel training strategies will be key in moving closer to a comprehensive solution for blind docking.

Conclusion

In summary, this paper makes significant strides in advancing the generalization capabilities of ML-based docking methods. Through the introduction of the DockGen benchmark and the innovative Confidence Bootstrapping training paradigm, it sets new performance benchmarks and opens exciting avenues for future research. The contributions of this work not only address a long-standing challenge within the drug discovery landscape but also underscore the pivotal role of machine learning innovations in unlocking new biological insights and accelerating therapeutic development.

Youtube Logo Streamline Icon: https://streamlinehq.com