Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 199 tok/s Pro
GPT OSS 120B 444 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

SciGPT: A Large Language Model for Scientific Literature Understanding and Knowledge Discovery (2509.08032v1)

Published 9 Sep 2025 in cs.CL

Abstract: Scientific literature is growing exponentially, creating a critical bottleneck for researchers to efficiently synthesize knowledge. While general-purpose LLMs show potential in text processing, they often fail to capture scientific domain-specific nuances (e.g., technical jargon, methodological rigor) and struggle with complex scientific tasks, limiting their utility for interdisciplinary research. To address these gaps, this paper presents SciGPT, a domain-adapted foundation model for scientific literature understanding and ScienceBench, an open source benchmark tailored to evaluate scientific LLMs. Built on the Qwen3 architecture, SciGPT incorporates three key innovations: (1) low-cost domain distillation via a two-stage pipeline to balance performance and efficiency; (2) a Sparse Mixture-of-Experts (SMoE) attention mechanism that cuts memory consumption by 55\% for 32,000-token long-document reasoning; and (3) knowledge-aware adaptation integrating domain ontologies to bridge interdisciplinary knowledge gaps. Experimental results on ScienceBench show that SciGPT outperforms GPT-4o in core scientific tasks including sequence labeling, generation, and inference. It also exhibits strong robustness in unseen scientific tasks, validating its potential to facilitate AI-augmented scientific discovery.

Summary

  • The paper introduces SciGPT, a domain-specific LLM that employs low-cost domain distillation and SMoE to enhance scientific literature understanding.
  • It utilizes a two-stage training process with SFT and DPO, enabling superior performance in NER, RE, and cross-domain knowledge fusion.
  • Results demonstrate that SciGPT outperforms GPT-4 on ScienceBench, achieving higher BLEU scores and improved factual accuracy in scientific tasks.

SciGPT: A Domain-Specific LLM for Scientific Literature

Introduction

The rapid expansion of scientific literature poses a significant challenge for efficient knowledge synthesis. LLMs such as GPT-4 have demonstrated notable potential in text processing yet often lack the capacity to effectively capture domain-specific language, especially in scientific contexts where technical jargon and rigorous methodologies are prevalent. This presents a critical obstacle to interdisciplinary research, where nuanced integration of diverse knowledge bases is essential. Addressing these challenges necessitates both architectural and data-driven innovations. SciGPT, a specialized domain-adapted LLM, emerges as a solution for understanding and discovering knowledge within scientific literature.

Methodology

SciGPT is built upon the Qwen3 architecture and brings several innovations tailored to the scientific domain: a low-cost domain distillation process, Sparse Mixture-of-Experts (SMoE) attention mechanism, and knowledge-aware adaptation.

The data collection strategy aggregated a comprehensive multi-source corpus from public scientific corpora, domain repositories, and synthetically generated data. The focus includes Named Entity Recognition (NER), Relation Extraction (RE), and cross-domain knowledge fusion. Data preparation involved significant cleaning processes—including hybrid filtering and MinHash deduplication—to ensure high-quality input. Figure 1

Figure 1: The distribution of different categories of pretraining data for SciGPT.

Key tasks were curated into ScienceBench, a novel benchmark designed to evaluate the scientific capabilities of LLMs across several dimensions, including factual accuracy, methodological rigor, and cross-reference coherence. Figure 2

Figure 2: Examples of questions.

Training

The training process of SciGPT features a two-stage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). SciGPT leverages Qwen3-8B as its base, due to its balance between computational efficiency and cross-domain performance.

The first stage of SFT involves structured understanding tasks, while the second phase transitions to generation-intensive tasks like summarization. Training utilized a mix of A800 and L40s GPUs with QLoRA, optimizing memory and computational efficiency. Figure 3

Figure 3: Schematic of LLM SciGPT.

DPO enhances SciGPT through preference learning, utilizing a hybrid dataset of human and AI-generated pairs to refine preference accuracy and factual consistency. Optimization strategies included AdamW with precision-targeted hyperparameters to maximize effectiveness in real-world scientific tasks.

Results

SciGPT's performance was evaluated on the bespoke ScienceBench benchmark, demonstrating significant advancement over general LLMs like GPT-4. Task-specific assessments reveal that SciGPT excels in tasks requiring fine-grained understanding of scientific contexts, including:

  • Sequence Labeling: SciGPT surpassed GPT-4 in NER and RE tasks, showcasing its proficiency in handling domain-specific terminologies.
  • Generation: The model achieved higher BLEU scores in machine translation, maintaining numerical and terminological precision.
  • Inference: SciGPT outperformed in semantic matching and knowledge fusion, particularly in integrating cross-disciplinary systems. Figure 4

    Figure 4: Performance of SciGPT models on ScienceBench.

(Table 1)

Performance comparison of SciGPT with GPT-4 on ScienceBench tasks, indicating significant improvements in domain-specific benchmarks.

Robustness and Generalization

SciGPT exhibits robust generalization capabilities, adapting well to unseen tasks such as those emerging from new relation extraction datasets. However, challenges remain in niche fields with limited training data, where generalization is reduced.

Conclusions and Future Works

SciGPT represents a significant step forward in enhancing the utility of LLMs for scientific literature analysis. The model's innovative architecture and methodologies not only address current bottlenecks in knowledge synthesis but also set a benchmark for future development of scientific LLMs.

Future endeavors will focus on enhancing interdisciplinary reasoning, improving the integration of multi-modal data, and advancing interpretability to ensure adherence to scientific research standards. With ongoing refinement, SciGPT has the potential to become an indispensable tool in scientific research, facilitating more efficient knowledge discovery and innovation across domains.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.