GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text (2308.06911v3)
Abstract: LLMs have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing LLMs cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal LLM that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736.
- One transformer fits all distributions in multi-modal diffusion at scale. arXiv preprint arXiv:2303.06555 .
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35, 32897–32912.
- Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 .
- An open source chemical structure curation pipeline using rdkit. Journal of Cheminformatics , 1–16.
- Generative models for molecular discovery: Recent advances and challenges. Wiley Interdisciplinary Reviews: Computational Molecular Science 12, e1608.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .
- Translation between molecules and natural language, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 375–413.
- Text2mol: Cross-modal molecule retrieval with natural language queries, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 595–607.
- Optical structure recognition software to recover chemical information: Osra, an open source solution.
- Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines 30, 681–694.
- Chebi in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research 44, D1214–D1219.
- Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.
- Graphmae: Self-supervised masked graph autoencoders, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 594–604.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045 .
- Scaling up visual and vision-language representation learning with noisy text supervision, in: International Conference on Machine Learning, PMLR. pp. 4904–4916.
- Pubchem 2019 update: improved access to chemical data. Nucleic Acids Research 47, D1102–D1109.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 .
- Empowering molecule discovery for molecule-caption translation with large language models: A chatgpt perspective. arXiv preprint arXiv:2306.06615 .
- Druggpt: A gpt-based strategy for designing potential ligands targeting specific proteins. bioRxiv , 2023–06.
- Multi-modal molecule structure-text model for text-based retrieval and editing. arXiv preprint arXiv:2212.10789 .
- Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728 .
- Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022.
- Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics 23, bbac409.
- OpenAI, 2023. GPT-4 technical report. CoRR abs/2303.08774.
- Molvec: Open source library for chemical structure recognition, in: Abstracts of Papers of the American Chemical Society, Amer Chemical Soc 1155 16TH ST, NW, WASHINGTON, DC 20036 USA.
- Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning, PMLR. pp. 8748–8763.
- Language models are unsupervised multitask learners. OpenAI blog 1, 9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 5485–5551.
- Decimer 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics 13, 1–16.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 .
- A generalist agent. arXiv preprint arXiv:2205.06175 .
- Counting on natural products for drug design. Nature Chemistry 8, 531–541.
- From show to tell: A survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 539–559.
- A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481 .
- Attention is all you need. Advances in neural information processing systems 30.
- Foundation transformers. arXiv preprint arXiv:2210.06423 .
- Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space. Briefings in Bioinformatics 23, bbac461.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442 .
- Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence 4, 279–287.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837.
- Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences 28, 31–36.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 .
- Moleculenet: a benchmark for molecular machine learning. Chemical Science 9, 513–530.
- Swinocsr: end-to-end optical chemical structure recognition using a swin transformer. Journal of Cheminformatics 14, 1–13.
- Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796 .
- A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature Communications 13, 862.
- Uni-mol: A universal 3d molecular representation learning framework, in: The Eleventh International Conference on Learning Representations.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 .
- Pengfei Liu (191 papers)
- Yiming Ren (22 papers)
- Zhixiang Ren (23 papers)
- Jun Tao (73 papers)