Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

M$^{3}$-20M: A Large-Scale Multi-Modal Molecule Dataset for AI-driven Drug Design and Discovery (2412.06847v2)

Published 8 Dec 2024 in q-bio.QM, cs.AI, and cs.LG

Abstract: This paper introduces M${3}$-20M, a large-scale Multi-Modal Molecule dataset that contains over 20 million molecules, with the data mainly being integrated from existing databases and partially generated by LLMs. Designed to support AI-driven drug design and discovery, M${3}$-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit the training or fine-tuning of models, including LLMs for drug design and discovery tasks. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M${3}$-20M in drug design and discovery, we conduct extensive experiments on two key tasks: molecule generation and molecular property prediction, using LLMs including GLM4, GPT-3.5, GPT-4, and Llama3-8b. Our experimental results show that M${3}$-20M can significantly boost model performance in both tasks. Specifically, it enables the models to generate more diverse and valid molecular structures and achieve higher property prediction accuracy than existing single-modal datasets, which validates the value and potential of M${3}$-20M in supporting AI-driven drug design and discovery. The dataset is available at https://github.com/bz99bz/M-3.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: