LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library (2408.06150v3)

Published 12 Aug 2024 in cs.CL, physics.chem-ph, and q-bio.BM

Abstract: In this study, we generate and maintain a database of 10 million virtual lipids through METiS's in-house de novo lipid generation algorithms and lipid virtual screening techniques. These virtual lipids serve as a corpus for pre-training, lipid representation learning, and downstream task knowledge transfer, culminating in state-of-the-art LNP property prediction performance. We propose LipidBERT, a BERT-like model pre-trained with the Masked LLM (MLM) and various secondary tasks. Additionally, we compare the performance of embeddings generated by LipidBERT and PhatGPT, our GPT-like lipid generation model, on downstream tasks. The proposed bilingual LipidBERT model operates in two languages: the language of ionizable lipid pre-training, using in-house dry-lab lipid structures, and the language of LNP fine-tuning, utilizing in-house LNP wet-lab data. This dual capability positions LipidBERT as a key AI-based filter for future screening tasks, including new versions of METiS de novo lipid libraries and, more importantly, candidates for in vivo testing for orgran-targeting LNPs. To the best of our knowledge, this is the first successful demonstration of the capability of a pre-trained LLM on virtual lipids and its effectiveness in downstream tasks using web-lab data. This work showcases the clever utilization of METiS's in-house de novo lipid library as well as the power of dry-wet lab integration.

Citations (3)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces LipidBERT, a transformer-based model pre-trained on a novel 10M de novo lipid library to address the scarcity of ionizable lipid structures.
The methodology leverages Masked Language Modeling and innovative secondary tasks to achieve R² values over 0.9 in lipid nanoparticle property prediction.
The work outperforms traditional models by clustering lipids via embedding visualization, paving the way for advanced design in lipid-based nanomedicine.

LipidBERT: A Lipid LLM Pre-trained on METiS de novo Lipid Library

The paper "LipidBERT: A Lipid LLM Pre-trained on METiS de novo Lipid Library" addresses a critical gap in the application of transformer-based architectures for lipid molecules by presenting a model tailored to ionizable lipids. The scarcity of experimentally confirmed ionizable lipid structures has historically hindered pre-training on these molecules. This work makes significant strides by generating a comprehensive de novo lipid library, consisting of 10 million virtual lipid structures, subsequently used to pre-train LipidBERT. This research highlights the application of self-supervised learning to predict properties of lipid nanoparticles (LNPs) effectively.

Methodology and Core Contributions

The paper introduces the METiS de novo Lipid Library, developed through fragment-based generative methods and reinforcement learning, providing a substantial corpus for lipid representation learning. The subsequent pre-training of LipidBERT leveraged this extensive library, with the model employing the Masked LLM (MLM) task akin to BERT, along with novel secondary tasks designed to capture the unique characteristics of lipids. These secondary tasks include number of tails prediction, connecting atom prediction (both sequence and token classification), head/tail classification, and rearranged/decoy SMILES classification.

Pre-Training Approach

LipidBERT undergoes a rigorous pre-training process, whereby it learns to predict masked tokens and performs secondary tasks critical to understanding lipid structures. Notably, the MLM and head/tail classification tasks effectively cluster lipids based on shared molecular substructures, demonstrating the model's capacity to distinguish lipid features. Visualization of embeddings further substantiates the model’s ability to categorize lipids, although connecting atom prediction tasks proved more challenging, highlighting areas for potential refinement in future iterations.

Fine-Tuning and Performance

The fine-tuning phase utilized both proprietary wet-lab experimental datasets and publicly available datasets (AGILE). The fine-tuned LipidBERT models exhibited remarkable predictive performance, especially in LNP property prediction, achieving $R^2$ values exceeding 0.9 for most properties. Fine-tuning on scaled datasets further reinforced the model's efficacy, showing improved $R^2$ values as training dataset sizes increased.

Comparison with Existing Models

LipidBERT was benchmarked against traditional XGBoost models trained on regular, Molecular Dynamics (MD), and Quantum Mechanics (QM) descriptors. LipidBERT outperformed these models, particularly in predicting overall fluorescence intensity, highlighting the superiority of pre-trained transformer-based architectures over descriptor-based methods. Additionally, LipidBERT demonstrated state-of-the-art performance when fine-tuned on the AGILE dataset, surpassing the graph-based AGILE model.

The comparison with PhatGPT, a GPT-like lipid generation model, revealed that LipidBERT is notably more effective for sequence regression tasks, while PhatGPT excels in generative tasks. This differentiation underscores the specific strengths of BERT-like models in certain domains of lipid research.

Implications and Future Directions

LipidBERT represents a significant milestone in leveraging transformer-based models for lipid molecules, pushing the boundaries of molecular property prediction. The integration of the METiS de novo lipid library and the LipidBERT model illustrates the potential for synergistic dry-wet lab integration. This approach not only paves the way for more efficient screening of potent lipid candidates but also fosters advancements in lipid-based nanomedicine development.

The paper hints at future developments, suggesting that subsequent versions of lipid generation methods, coupled with an expanding wet-lab dataset, will further enhance LipidBERT’s predictive capabilities. This ongoing update cycle, maintaining a library size of 10 million lipids, provides a dynamic platform for continuous improvement and discovery.

Conclusion

The introduced LipidBERT model exemplifies the power of self-supervised learning in molecular sciences. By addressing the scarcity of ionizable lipid structures and employing innovative secondary tasks, the model achieves unparalleled performance in LNP property prediction. As LipidBERT evolves with additional data and advanced methodologies, it holds promise for significantly accelerating the discovery and optimization of lipid molecules, thereby contributing to the broader field of lipid-based therapeutic delivery systems. The AiLNP platform, incorporating LipidBERT and PhatGPT, stands as a strategic tool driving research and application in this critical domain.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (13)

Tweets

https://twitter.com/rkakamilan/status/1824090390410223684