Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing (2212.10789v3)

Published 21 Dec 2022 in cs.LG, cs.CL, q-bio.QM, and stat.ML

Abstract: There is increasing adoption of artificial intelligence in drug discovery. However, existing studies use machine learning to mainly utilize the chemical structures of molecules but ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions and predict complex biological activities. Here we present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct a large multi-modal dataset, namely, PubChemSTM, with over 280,000 chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM has two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Shengchao Liu (30 papers)
  2. Weili Nie (41 papers)
  3. Chengpeng Wang (13 papers)
  4. Jiarui Lu (31 papers)
  5. Zhuoran Qiao (9 papers)
  6. Ling Liu (132 papers)
  7. Jian Tang (327 papers)
  8. Chaowei Xiao (110 papers)
  9. Anima Anandkumar (236 papers)
Citations (114)