BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (2406.03686v1)

Published 6 Jun 2024 in cs.LG

Abstract: Generating novel active molecules for a given protein is an extremely challenging task for generative models that requires an understanding of the complex physical interactions between the molecule and its environment. In this paper, we present a novel generative model, BindGPT which uses a conceptually simple but powerful approach to create 3D molecules within the protein's binding site. Our model produces molecular graphs and conformations jointly, eliminating the need for an extra graph reconstruction step. We pretrain BindGPT on a large-scale dataset and fine-tune it with reinforcement learning using scores from external simulation software. We demonstrate how a single pretrained LLM can serve at the same time as a 3D molecular generative model, conformer generator conditioned on the molecular graph, and a pocket-conditioned 3D molecule generator. Notably, the model does not make any representational equivariance assumptions about the domain of generation. We show how such simple conceptual approach combined with pretraining and scaling can perform on par or better than the current best specialized diffusion models, LLMs, and graph neural networks while being two orders of magnitude cheaper to sample.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces BindGPT, a novel model that integrates language modeling and reinforcement learning for simultaneous 3D molecular structure and binding affinity generation.
It employs structural SMILES and XYZ descriptors to encode molecules, achieving nearly 99% validity and up to 100x faster sampling than existing methods.
The reinforcement learning fine-tuning with docking score feedback enables effective pocket-conditioned generation, advancing drug design applications.

BindGPT: A Scalable Framework for 3D Molecular Design via LLMing and Reinforcement Learning

The paper introduces BindGPT, a generative model engineered for the creation of 3D molecular structures optimized for binding affinities with given protein pockets. This work represents an overview of LLMing and reinforcement learning techniques, tailored to the domain of 3D molecular design. BindGPT exhibits a noteworthy ability to generate molecular graphs and their corresponding 3D conformations simultaneously, thus obviating the need for separate graph reconstruction steps typically required in traditional approaches.

Methodology

The authors leverage a GPT-based LLM to represent complex 3D molecular structures as sequences of textual tokens. The innovation here is the adoption of structural SMILES and spatial XYZ descriptors to encode atomic positions and bonds, eliminating reliance on external tools like OpenBabel for graph reconstruction. The pretraining stage utilizes a substantial dataset of 3D molecular structures, comprising 208 million conformations across 12 million molecules, alongside 3.2 million protein pocket structures. The model is subsequently fine-tuned on the CrossDocked dataset, which contains aligned pocket-ligand pairs. This two-stage training approach places BindGPT in a robust position to model both unconditional 3D molecular generation and pocket-conditioned tasks.

Pretraining employs a large batch regime, achieving effective convergence through modern techniques like Flash-Attention and the DeepSpeed optimization accelerator. A vital component of the model’s input representation during pretraining involves combining ligand structures and protein pockets into a contiguous sequence of tokens, enhancing the model's contextual understanding of molecular interactions.

Supervised fine-tuning extends the pretrained capabilities of BindGPT, utilizing the CrossDocked dataset, significantly augmented with rotated coordinates and randomized SMILES representations to mitigate overfitting. In the pocket-conditioned generation task, additional context in the form of binding affinity scores is employed, aligning model predictions with high-affinity conformations.

Results

Generative Modeling of 3D Molecules

BindGPT's generative prowess is substantiated through metrics such as Validity, Synthetic Accessibility (SA), Quantitative Estimation of Druglikeness (QED), and Lipinski's Rule of Five compliance. Evaluations indicate a striking performance, with BindGPT achieving nearly 99\% validity in molecule generation tasks, significantly better than competing models like XYZ-Transformer, which registered only about 13% validity. Beyond validity, BindGPT's 3D structure metrics, including RMSD and time to sample, demonstrate superiority, achieving up to 100x speedup over existing models.

In conformation generation, BindGPT matches the performance of specialized models like Torsional Diffusion and Uni-Mol, especially when aided by tools like RDKit. This underscores the model's capacity for accurate spatial representation of molecular structures, pivotal in applications like drug design where 3D accuracy is critical.

Pocket-Conditioned Molecular Generation

The paper highlights BindGPT's application in generating molecules with high binding affinities for specified protein pockets. Within this task, models were evaluated based on their ability to generate ligands with favorable QVINA scores and maintain druglike properties as measured by SA and QED. BindGPT demonstrates superior binding affinity, particularly its Reinforcement Learning (RL) finetuned variant, which optimizes sampling through feedback from docking scores. This RL-enhanced model significantly outperforms baselines like Pocket2Mol and TargetDiff, showcasing not only better binding fitness but also maintaining commendable druglikeness properties.

Implications

The ability of BindGPT to generate accurate 3D molecular structures with proficiency comparable to specialized models, yet with reduced computational overhead, presents substantial practical benefits. By integrating reinforcement learning, BindGPT can adapt to novel protein targets, thereby broadening its application in drug discovery pipelines. Conceptually, this method paves the way for more generalized AI models in the domain of biochemical and pharmaceutical design, capable of tackling diverse generative tasks within a unified framework.

Future Directions

The BindGPT framework offers promising avenues for future research. One potential direction is the exploration of even larger pretraining datasets, which could further enhance model accuracy and generalization. Moreover, integrating more sophisticated forms of feedback beyond docking scores, such as those based on experimental binding data, could refine the RL training process. Finally, extending the framework to handle even more complex molecular interactions, such as those involving multi-protein complexes or dynamic protein conformations, could significantly broaden its applicability in computational biology and medicinal chemistry.

In conclusion, BindGPT represents a significant step forward in the field of AI-driven drug design, combining scalability, efficiency, and high performance across a spectrum of critical molecular design tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/amelie_iska/status/1800328670458024078

https://twitter.com/rkakamilan/status/1800137033110942054

https://twitter.com/artemZholus/status/1824558728214958512

https://twitter.com/artemZholus/status/1833321113406845215

https://twitter.com/artemZholus/status/1802194491647947098

https://twitter.com/apsarathchandar/status/1895102286210244895