SciDFM: A Large Language Model with Mixture-of-Experts for Science (2409.18412v3)

Published 27 Sep 2024 in cs.CL and cs.AI

Abstract: Recently, there has been a significant upsurge of interest in leveraging LLMs to assist scientific discovery. However, most LLMs only focus on general science, while they lack domain-specific knowledge, such as chemical molecules and amino acid sequences. To bridge these gaps, we introduce SciDFM, a mixture-of-experts LLM, which is trained from scratch and is able to conduct college-level scientific reasoning and understand molecules and amino acid sequences. We collect a large-scale training corpus containing numerous scientific papers and books from different disciplines as well as data from domain-specific databases. We further fine-tune the pre-trained model on lots of instruction data to improve performances on downstream benchmarks. From experiment results, we show that SciDFM achieves strong performance on general scientific benchmarks such as SciEval and SciQ, and it reaches a SOTA performance on domain-specific benchmarks among models of similar size. We further analyze the expert layers and show that the results of expert selection vary with data from different disciplines. To benefit the broader research community, we open-source SciDFM at https://huggingface.co/OpenDFM/SciDFM-MoE-A5.6B-v1.0.

Authors (10)

Liangtai Sun (8 papers)
Danyu Luo (1 paper)
Da Ma (28 papers)
Zihan Zhao (37 papers)
Baocai Chen (2 papers)
Zhennan Shen (4 papers)
Su Zhu (29 papers)
Lu Chen (245 papers)
Xin Chen (457 papers)
Kai Yu (202 papers)

Citations (1)

View on Semantic Scholar

Summary

SciDFM: A LLM with Mixture-of-Experts for Science

Introduction

The manuscript presents SciDFM, an advanced LLM designed to enhance scientific research capabilities, specifically in disciplines requiring domain-specific knowledge such as molecular chemistry and amino acid sequences. This model integrates a Mixture-of-Experts (MoE) architecture within a transformer-based framework to achieve sophisticated reasoning and comprehension across various scientific modalities, including text, molecules, and proteins.

Model Architecture and Training

SciDFM comprises 18.2 billion parameters, with 5.6 billion parameters activated, leveraging the transformer architecture alongside modifications derived from Llama models like RMSNorm, RoPE, and SwiGLU. The MoE architecture replaces standard feed-forward blocks to enable more effective modeling of distinct scientific domains. The tokenizer is built using Byte-Pair Encoding (BPE) method, optimized to handle chemical atoms and amino acid sequences as distinct tokens, enhancing the model's capability to process and understand specialized vocabulary.

Pretraining Corpus and Methodology

The pretraining dataset includes approximately 300 billion tokens from scientific literature and 270 billion tokens from general domains, totaling 570 billion tokens. This corpus incorporates a broad spectrum of scientific disciplines by sourcing data from specific databases such as PubChem, Uniprot, and RefSeq Genome, along with general scientific literature from sources like Arxiv and SlimPajama-Arxiv. The model was pretrained for two epochs with distinct learning rates, utilizing the AdamW optimizer with a cosine learning rate schedule. The training process was conducted over two months on a cluster of 128 A800 GPUs.

Instruction Tuning

To optimize SciDFM for downstream tasks, the model was fine-tuned on approximately 9.3 million samples encompassing various instruction-following data from sources including scientific datasets, general knowledge datasets, and domain-specific sets like Mol-Instructions and WebInstructSub. Fine-tuning was done over five epochs to refine SciDFM's performance on targeted scientific benchmarks.

Evaluation and Results

General Scientific Benchmarks

SciDFM was tested against a suite of general scientific evaluation tasks, including SciEval, SciQ, ARC, GSM8K, MATH, MedQA, MedMCQA, and PubMedQA. The model demonstrated competitive performance, often outperforming other models of similar size, such as Galactica-6.7B and ChatGLM-6B. On average, SciDFM exhibited stronger results in mathematics and biology domains but showed relatively weaker performance in general science tasks.

Domain-Specific Benchmarks

On domain-specific benchmarks, SciDFM excelled, achieving state-of-the-art performance on tasks such as molecular property prediction (MoleculeNet) and various tasks within the Mol-Instructions dataset. These include molecular design, reagent prediction, and protein understanding tasks. The model's performance in these areas underscores its capability to understand and apply domain-specific knowledge effectively.

Expert Choices Analysis

An analysis using t-SNE visualizations revealed clear clustering patterns in SciDFM's expert choices based on different scientific domains. This visualization not only affirmed the model's capability to distinguish between disciplines but also highlighted the interrelationships between fields such as mathematics and physics, and chemistry and biology. Such detailed nuance in expert layer selection points to the sophisticated internal workings of the MoE architecture.

Implications and Future Directions

SciDFM's advanced performance and nuanced understanding of domain-specific content suggest significant potential for its application in scientific research and discovery. By being open-sourced, it offers a valuable tool for the broader research community to further explore and develop innovative solutions in individual scientific fields.

Looking ahead, future developments in AI could include more specialized pretraining and fine-tuning approaches that enhance domain-specific models' capabilities. Additionally, integrating multi-modal data more effectively within models like SciDFM could pave the way for even broader applications across various scientific disciplines.

Related Work

SciBDRT, ProGen2, Med-PaLM, and ChemDFM are other notable models contributing to domain-specific LLMs for science. These models have made strides in scientific literature comprehension, protein sequence understanding, medical question answering, and chemistry-specific tasks, respectively. Compared to these, SciDFM offers a more generalized yet sophisticated approach by leveraging the MoE architecture for handling multiple scientific domains effectively.

Conclusion

SciDFM represents a significant step forward in leveraging LLMs for scientific research. Its architecture, training methodology, and empirical results demonstrate its strong potential for enhancing domain-specific scientific understanding and reasoning. By opening the model for community use, SciDFM promises to further innovation and discovery in various scientific fields, emphasizing the importance of advanced, specialized AI models in contributing to scientific progress.

PDF Markdown

Related Papers

Tweets

https://twitter.com/gm8xx8/status/1840874320773726676