SciDFM: A LLM with Mixture-of-Experts for Science
Introduction
The manuscript presents SciDFM, an advanced LLM designed to enhance scientific research capabilities, specifically in disciplines requiring domain-specific knowledge such as molecular chemistry and amino acid sequences. This model integrates a Mixture-of-Experts (MoE) architecture within a transformer-based framework to achieve sophisticated reasoning and comprehension across various scientific modalities, including text, molecules, and proteins.
Model Architecture and Training
SciDFM comprises 18.2 billion parameters, with 5.6 billion parameters activated, leveraging the transformer architecture alongside modifications derived from Llama models like RMSNorm, RoPE, and SwiGLU. The MoE architecture replaces standard feed-forward blocks to enable more effective modeling of distinct scientific domains. The tokenizer is built using Byte-Pair Encoding (BPE) method, optimized to handle chemical atoms and amino acid sequences as distinct tokens, enhancing the model's capability to process and understand specialized vocabulary.
Pretraining Corpus and Methodology
The pretraining dataset includes approximately 300 billion tokens from scientific literature and 270 billion tokens from general domains, totaling 570 billion tokens. This corpus incorporates a broad spectrum of scientific disciplines by sourcing data from specific databases such as PubChem, Uniprot, and RefSeq Genome, along with general scientific literature from sources like Arxiv and SlimPajama-Arxiv. The model was pretrained for two epochs with distinct learning rates, utilizing the AdamW optimizer with a cosine learning rate schedule. The training process was conducted over two months on a cluster of 128 A800 GPUs.
Instruction Tuning
To optimize SciDFM for downstream tasks, the model was fine-tuned on approximately 9.3 million samples encompassing various instruction-following data from sources including scientific datasets, general knowledge datasets, and domain-specific sets like Mol-Instructions and WebInstructSub. Fine-tuning was done over five epochs to refine SciDFM's performance on targeted scientific benchmarks.
Evaluation and Results
General Scientific Benchmarks
SciDFM was tested against a suite of general scientific evaluation tasks, including SciEval, SciQ, ARC, GSM8K, MATH, MedQA, MedMCQA, and PubMedQA. The model demonstrated competitive performance, often outperforming other models of similar size, such as Galactica-6.7B and ChatGLM-6B. On average, SciDFM exhibited stronger results in mathematics and biology domains but showed relatively weaker performance in general science tasks.
Domain-Specific Benchmarks
On domain-specific benchmarks, SciDFM excelled, achieving state-of-the-art performance on tasks such as molecular property prediction (MoleculeNet) and various tasks within the Mol-Instructions dataset. These include molecular design, reagent prediction, and protein understanding tasks. The model's performance in these areas underscores its capability to understand and apply domain-specific knowledge effectively.
Expert Choices Analysis
An analysis using t-SNE visualizations revealed clear clustering patterns in SciDFM's expert choices based on different scientific domains. This visualization not only affirmed the model's capability to distinguish between disciplines but also highlighted the interrelationships between fields such as mathematics and physics, and chemistry and biology. Such detailed nuance in expert layer selection points to the sophisticated internal workings of the MoE architecture.
Implications and Future Directions
SciDFM's advanced performance and nuanced understanding of domain-specific content suggest significant potential for its application in scientific research and discovery. By being open-sourced, it offers a valuable tool for the broader research community to further explore and develop innovative solutions in individual scientific fields.
Looking ahead, future developments in AI could include more specialized pretraining and fine-tuning approaches that enhance domain-specific models' capabilities. Additionally, integrating multi-modal data more effectively within models like SciDFM could pave the way for even broader applications across various scientific disciplines.
Related Work
SciBDRT, ProGen2, Med-PaLM, and ChemDFM are other notable models contributing to domain-specific LLMs for science. These models have made strides in scientific literature comprehension, protein sequence understanding, medical question answering, and chemistry-specific tasks, respectively. Compared to these, SciDFM offers a more generalized yet sophisticated approach by leveraging the MoE architecture for handling multiple scientific domains effectively.
Conclusion
SciDFM represents a significant step forward in leveraging LLMs for scientific research. Its architecture, training methodology, and empirical results demonstrate its strong potential for enhancing domain-specific scientific understanding and reasoning. By opening the model for community use, SciDFM promises to further innovation and discovery in various scientific fields, emphasizing the importance of advanced, specialized AI models in contributing to scientific progress.