DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening (2310.06367v1)
Abstract: Virtual screening, which identifies potential drugs from vast compound databases to bind with a particular protein pocket, is a critical step in AI-assisted drug discovery. Traditional docking methods are highly time-consuming, and can only work with a restricted search library in real-life applications. Recent supervised learning approaches using scoring functions for binding-affinity prediction, although promising, have not yet surpassed docking methods due to their strong dependency on limited data with reliable binding-affinity labels. In this paper, we propose a novel contrastive learning framework, DrugCLIP, by reformulating virtual screening as a dense retrieval task and employing contrastive learning to align representations of binding protein pockets and molecules from a large quantity of pairwise data without explicit binding-affinity scores. We also introduce a biological-knowledge inspired data augmentation strategy to learn better protein-molecule representations. Extensive experiments show that DrugCLIP significantly outperforms traditional docking and supervised learning methods on diverse virtual screening benchmarks with highly reduced computation time, especially in zero-shot setting.
- A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics Oxford, 2010.
- Bigbind: Learning from nonstructural data for structure-based virtual screening. 2022.
- A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
- Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European conference on computer vision (ECCV), pages 518–533, 2018.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
- Nnscore 2.0: A neural-network receptor–ligand scoring function. Journal of Chemical Information & Modeling, 51(11):2897, 2011.
- Cosp: Co-supervised pretraining of pocket and ligand, 2022.
- Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100–D1107, 2012.
- Application of the three-dimensional structures of protein target molecules in structure-based drug design. Journal of medicinal chemistry, 37(8):1035–1054, 1994.
- Bootstrap your own latent a new approach to self-supervised learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
- Glide: a new approach for rapid, accurate docking and scoring. 2. enrichment factors in database screening. Journal of Medicinal Chemistry, 47(7):1750–1759, 2004.
- Mmseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics, 32(9):1323–1330, 2016.
- Decoding surface fingerprints for protein-ligand interactions. bioRxiv, pages 2022–04, 2022.
- Zinc- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45(1):177–182, 2005.
- Hidden markov model speed heuristic and iterative hmm search procedure. BMC bioinformatics, 11:1–8, 2010.
- Improved protein–ligand binding affinity prediction with structure-based deep fusion inference. Journal of chemical information and modeling, 61(4):1583–1592, 2021.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Structure-based drug design to augment hit discovery. Drug discovery today, 16(17-18):831–839, 2011.
- Bespoke library docking for 5-ht2a receptor agonists with antidepressant activity. Nature, 610(7932):582–591, 2022. URL https://EconPapers.repec.org/RePEc:nat:nature:v:610:y:2022:i:7932:d:10.1038_s41586-022-05258-z.
- Deep learning in virtual screening: recent applications and developments. International Journal of Molecular Sciences, 22(9):4435, 2021.
- Lessons learned in empirical scoring with smina from the csar 2011 benchmarking exercise. Journal of chemical information and modeling, 53(8):1893–1904, 2013.
- Denvis: scalable and high-throughput virtual screening using graph neural networks with atomic and surface protein pocket features. bioRxiv, 2022. doi: 10.1101/2022.03.17.484710. URL https://www.biorxiv.org/content/early/2022/09/13/2022.03.17.484710.
- Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 8, 2013.
- Effective drug–target interaction prediction with mutual interaction neural network. Bioinformatics, 38(14):3582–3589, 2022.
- Ultra-large library docking for discovering new chemotypes. Nature, 566(7743):1, 2019.
- Structure-based virtual screening: from classical to artificial intelligence. Frontiers in chemistry, 8:343, 2020.
- Gnina 1.0: molecular docking with deep learning. Journal of cheminformatics, 13(1):1–20, 2021.
- Directory of useful decoys, enhanced (dud-e): better ligands and decoys for better benchmarking. Journal of medicinal chemistry, 55(14):6582–6594, 2012.
- Graphdta: predicting drug–target binding affinity with graph neural networks. Bioinformatics, 37(8):1140–1147, 2021.
- Deepdta: deep drug–target binding affinity prediction. Bioinformatics, 34(17):i821–i829, 2018.
- Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/radford21a.html.
- Synthon-based ligand discovery in virtual libraries of over 11 billion compounds. Nature, 601(7893):452–459, 2022.
- E (n) equivariant graph neural networks. In International conference on machine learning, pages 9323–9332. PMLR, 2021.
- How good are alphafold models for docking-based virtual screening? Iscience, 26(1), 2023.
- Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in neural information processing systems, 30, 2017.
- Surflex-dock: Docking benchmarks and real-world application. Journal of computer-aided molecular design, 26:687–699, 2012.
- Pafnucy - a deep neural network for structure-based drug discovery. ArXiv, abs/1712.07042, 2017.
- Graph convolutional neural networks for predicting drug-target interactions. Journal of Chemical Information and Modeling, 59(10):4131–4149, 2019. doi: 10.1021/acs.jcim.9b00628. URL https://doi.org/10.1021/acs.jcim.9b00628. PMID: 31580672.
- Lit-pcba: an unbiased data set for machine learning and virtual screening. Journal of chemical information and modeling, 60(9):4263–4273, 2020.
- Autodock vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem., 31(2):NA–NA, 2009.
- Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
- Improved protein–ligand docking using gold. Proteins: Structure, Function, and Bioinformatics, 2003.
- Yuel: Improving the generalizability of structure-free compound–protein interaction prediction. Journal of Chemical Information and Modeling, 62(3):463–471, 2022.
- The pdbbind database: methodologies and updates. Journal of medicinal chemistry, 48(12):4111–4119, 2005.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
- Biolip: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic acids research, 41(D1):D1096–D1103, 2012.
- AttentionSiteDTI: an interpretable graph-based model for drug-target interaction prediction using NLP sentence-level relation classification. Briefings in Bioinformatics, 23(4), 07 2022. ISSN 1477-4054. doi: 10.1093/bib/bbac272. URL https://doi.org/10.1093/bib/bbac272. bbac272.
- Planet: A multi-objective graph neural network model for protein–ligand binding affinity prediction, 2023. URL https://doi.org/10.1101/2023.02.01.526585.
- Tm-align: a protein structure alignment algorithm based on the tm-score. Nucleic acids research, 33(7):2302–2309, 2005.
- A statistical framework to evaluate virtual screening. BMC Bioinformatics, 10:225 – 225, 2009.
- Onionnet: a multiple-layer intermolecular-contact-based convolutional neural network for protein–ligand binding affinity prediction. ACS omega, 4(14):15956–15965, 2019.
- Predicting drug–protein interaction using quasi-visual question answering system. Nature Machine Intelligence, 2:134–140, 02 2020a. doi: 10.1038/s42256-020-0152-y.
- Predicting drug–protein interaction using quasi-visual question answering system. Nature Machine Intelligence, 2(2):134–140, 2020b.
- Uni-mol: A universal 3d molecular representation learning framework. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6K2RM6wVqKu.