Synthetic Data from Diffusion Models Improve Drug Discovery Prediction (2405.03799v1)
Abstract: AI is increasingly used in every stage of drug development. Continuing breakthroughs in AI-based methods for drug discovery require the creation, improvement, and refinement of drug discovery data. We posit a new data challenge that slows the advancement of drug discovery AI: datasets are often collected independently from each other, often with little overlap, creating data sparsity. Data sparsity makes data curation difficult for researchers looking to answer key research questions requiring values posed across multiple datasets. We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end. We show and provide a methodology for sampling pharmacokinetic data for existing ligands using our Syngand model. We show the initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central. Using our proposed model and methodology, researchers can easily generate synthetic ligand data to help them explore research questions that require data spanning multiple datasets.
- Comprehensive survey of recent drug discovery using deep learning. International Journal of Molecular Sciences, 22(18):9983, 2021.
- Drug repurposing: progress, challenges and recommendations. Nature reviews Drug discovery, 18(1):41–58, 2019.
- Gene prioritization by compressive data fusion and chaining. PLoS computational biology, 11(10):e1004552, 2015.
- Affinity2vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learning. Scientific reports, 12(1):4751, 2022.
- Network medicine framework for identifying drug-repurposing opportunities for covid-19. Proceedings of the National Academy of Sciences, 118(19), April 2021.
- Drug-target interaction prediction based on multi-similarity fusion and sparse dual-graph regularized matrix factorization. IEEE Access, 9:99718–99730, 2021.
- Pan-cancer prediction of cell-line drug sensitivity using network-based methods. International Journal of Molecular Sciences, 23(3):1074, 2022.
- A deep learning approach to antibiotic discovery. Cell, 180(4):688–702, 2020.
- Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development, 2021.
- Guacamol: Benchmarking models for de novo molecular design. Journal of Chemical Information and Modeling, 59(3):1096–1108, 2019. PMID: 30887799.
- The chembl database in 2017. Nucleic acids research, 45(D1):D945–D954, 2017.
- Pubchem 2023 update. Nucleic acids research, 51(D1):D1373–D1380, 2023.
- A metastasis map of human cancer cell lines. Nature, 588(7837):331–336, 2020.
- Aqsoldb, a curated reference set of aqueous solubility and 2d descriptors for a diverse set of compounds. Scientific data, 6(1):143, 2019.
- Quantitative structure- activity relationship modeling of rat acute toxicity by oral exposure. Chemical research in toxicology, 22(12):1913–1921, 2009.
- hergcentral: a large database to store, retrieve, and analyze compound-human ether-a-go-go related gene channel interactions to facilitate cardiotoxicity assessment in drug development. Assay and drug development technologies, 9(6):580–588, 2011.
- Denoising diffusion probabilistic models, 2020.
- Diffusion models in bioinformatics and computational biology. Nature Reviews Bioengineering, pages 1–19, 2023.
- Digress: Discrete denoising diffusion for graph generation, 2023.
- Diffusion-based molecule generation with informative prior bridges, 2022.
- Equivariant 3d-conditional diffusion models for molecular linker design, 2022.
- Conditional diffusion based on discrete graph structures for molecular graph generation, 2023.
- Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
- Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
- Score-based generative modeling of graphs via the system of stochastic differential equations, 2022.
- E(n) equivariant normalizing flows for molecule generation in 3d. CoRR, abs/2105.09016, 2021.
- Equivariant diffusion for molecule generation in 3d, 2022.
- Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
- A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699, 2020.
- Photorealistic text-to-image diffusion models with deep language understanding, 2022.
- Hierarchical text-conditional image generation with clip latents, 2022.
- Conditional beta-vae for de novo molecular generation. arXiv preprint arXiv:2205.01592, 2022.
- Tristan Aumentado-Armstrong. Latent molecular optimization for targeted therapeutic design. arXiv preprint arXiv:1809.02032, 2018.
- Exploring chemical space with score-based out-of-distribution generation. In International Conference on Machine Learning, pages 18872–18892. PMLR, 2023.
- Relation: A deep generative model for structure-based de novo drug design. Journal of Medicinal Chemistry, 65(13):9478–9492, 2022.
- Masked graph modeling for molecule generation. Nature communications, 12(1):3156, 2021.
- Efficient learning of non-autoregressive graph variational autoencoders for molecular graph generation. Journal of Cheminformatics, 11, 11 2019.
- Zinc- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45(1):177–182, 2005.
- Formulation design for poorly water-soluble drugs based on biopharmaceutics classification system: Basic approaches and practical applications. International Journal of Pharmaceutics, 420(1):1–10, 2011.
- Bioavailability enhancement techniques for poorly aqueous soluble drugs and therapeutics. Biomedicines, 10(9), 2022.
- Variational diffusion models, 2023.
- Greg Landrum. Rdkit documentation. Release, 1(1-79):4, 2013.
- Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
- Chemberta: Large-scale self-supervised pretraining for molecular property prediction, 2020.