Drug Discovery SMILES-to-Pharmacokinetics Diffusion Models with Deep Molecular Understanding

Published 14 Aug 2024 in q-bio.QM, cs.AI, and cs.LG | (2408.07636v2)

Abstract: AI is increasingly used in every stage of drug development. One challenge facing drug discovery AI is that drug pharmacokinetic (PK) datasets are often collected independently from each other, often with limited overlap, creating data overlap sparsity. Data sparsity makes data curation difficult for researchers looking to answer research questions in poly-pharmacy, drug combination research, and high-throughput screening. We propose Imagand, a novel SMILES-to-Pharmacokinetic (S2PK) diffusion model capable of generating an array of PK target properties conditioned on SMILES inputs. We show that Imagand-generated synthetic PK data closely resembles real data univariate and bivariate distributions, and improves performance for downstream tasks. Imagand is a promising solution for data overlap sparsity and allows researchers to efficiently generate ligand PK data for drug discovery research. Code is available at https://github.com/bing1100/Imagand.

Abstract PDF HTML Upgrade to Chat

Authors (3)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Imagand, a novel SMILES-to-PK diffusion model that mitigates data sparsity in pharmacokinetics by generating synthetic data.
It integrates pretrained SMILES encoders and a discrete local Gaussian noise model to capture complex molecular structures and enhance data fidelity.
Experimental results using metrics like MSE, R2, PCC, and Hellinger Distance demonstrate a high correlation between synthetic and real pharmacokinetic profiles.

SMILES-to-Pharmacokinetics Diffusion Models in Drug Discovery

The paper "Drug Discovery SMILES-to-Pharmacokinetics Diffusion Models with Deep Molecular Understanding" by Bing Hu, Anita Layton, and Helen Chen addresses the crucial issue of data sparsity in pharmacokinetics (PK) using a novel approach called Imagand. The model leverages the Synthetic Molecular Input Line Entry System (SMILES) for generating comprehensive PK profiles, conditioned on learned SMILES embeddings, to facilitate more robust drug discovery processes.

Introduction

The significant challenge the paper aims to overcome is the data sparsity that plagues drug discovery, particularly PK datasets where limited overlap hinders research in poly-pharmacy, drug combinations, and high-throughput screening (HTS). Traditional methods for PK data collection are not only expensive but also time-consuming, necessitating novel AI methods to generate synthetic yet realistic data. With this context, Imagand—a SMILES-to-Pharmacokinetics (S2PK) diffusion model—was proposed, aiming to generate a range of PK properties from SMILES inputs.

Methodology

Pre-trained SMILES Encoder

The Imagand model incorporates profound molecular understanding via SMILES encoders. Pretrained models like ChemBERTa, T5, and DeBERTa were employed, with the latter two outperforming on various tasks. Notably, embeddings were pre-trained on large SMILES-only corpora such as PubChem, capturing complex molecular structures better than smaller, specific datasets.

Diffusion Model Architecture

The Imagand model operates on a denoising diffusion probabilistic framework. It utilizes a forward process to introduce noise and a reverse process to denoise, effectively learning data structure restoration from noise. This denoising is coupled with a classifier-free guidance mechanism, enhancing quality via dropout during training to achieve conditional and unconditional training.

Discrete Local Gaussian Noise Model

A novel noise model, Discrete Local Gaussian Noise (DLGN), was introduced to improve the resemblance between synthetic and real data. By decomposing PK distributions into local Gaussian distributions within discrete bins, DLGN aligns training noise more closely with true data distributions, thus enhancing model performance.

Pharmacokinetic Datasets

The study utilized ten diverse PK datasets from TDCommons, covering absorption, distribution, metabolism, excretion (ADME), and toxicity. The assimilation of these varied datasets resulted in a comprehensive PK profile for 30,000 unique drugs, providing a rich basis for model training and testing.

Experimental Validation

Machine Learning Efficiency

Imagand was benchmarked on multiple metrics including Mean Squared Error (MSE), R-squared (R2), and Pearson Correlation Coefficient (PCC). Tests revealed that synthetic data generated by Imagand closely matched real data for numerous datasets, sometimes outperforming real data, especially in sparse regions.

Univariate and Bivariate Analysis

Imagand's synthetic data was evaluated against real data using Hellinger Distance (HD) for univariate analysis. The calculated Hellinger Distances demonstrated a high fidelity of synthetic data to real data distributions. For bivariate analysis, Differential Pairwise Correlations (DPC) were computed, showing close alignment between synthetic and actual correlations among PK properties.

Implications and Future Work

The generation of high-fidelity synthetic PK data addresses significant bottlenecks in drug discovery, particularly in early-stage screening and poly-pharmacy research. By enabling robust data augmentation, Imagand opens new avenues for efficient and cost-effective drug development pipelines.

Future work could extend Imagand’s capabilities to categorical PK properties, and larger datasets, enhancing model granularity and scope. The proposed methodology also sets a precedent for developing similar models in other domains of pharmaceutical research and beyond.

In conclusion, Imagand represents a sophisticated approach to mitigating data sparsity challenges in PK datasets, leveraging advanced diffusion models with deep molecular embeddings. This innovation not only enhances drug discovery processes but also broadens the scope for high-throughput screenings, potentially accelerating the journey of drug development.

Markdown Report Issue