Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

124 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Scientific Large Language Models: A Survey on Biological & Chemical Domains (2401.14656v2)

Published 26 Jan 2024 in cs.CL

Abstract: LLMs have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.

References (317)

Citations (31)

View on Semantic Scholar

Summary

The paper surveys scientific LLMs tailored for biological and chemical data, emphasizing specialized Transformer architectures and multi-modal dataset integrations.
It details the training challenges and evaluation complexities, including dataset scarcity and the development of reliable computational benchmarks.
The survey underscores ethical concerns such as data privacy, bias mitigation, and equitable access while outlining key future research directions.

Introduction

Scientific LLMs (Sci-LLMs) encompass an advanced subclass specifically crafted for facilitating scientific discovery within the AI-for-Science community. These models delve deep into the field of "scientific language", a term that refers to specialized vocabularies and grammatical constructs developed within scientific disciplines, distinct from conventional natural language. This survey presents an intricate examination of Sci-LLMs, focusing on their roles in the biological and chemical domains.

Data and Model Architecture

A core aspect of Sci-LLM development involves constructing comprehensive datasets for training and fine-tuning these models. Such datasets span textual, molecular, protein, and genomic languages, often surpassing the scope and complexity of standard linguistic systems. Sci-LLMs require robust architectures that can accommodate the idiosyncrasies of scientific data—lengthy sequences in molecular languages, intricate 3D structures in proteins, or the multi-modal nature encompassing text and other scientific entities. To address these challenges, researchers have devised variations on the Transformer architecture, integrating novel attention mechanisms and pre-training strategies.

Training and Evaluation Challenges

The survey notes that despite recent advancements, there are persistent challenges concerning the scale and quality of training datasets. Cross-modal datasets, essential for enabling multi-faceted interactions among different types of scientific data, are particularly scarce and require rigorous semantic alignment. Moreover, evaluating Sci-LLMs poses its own set of complexities, especially for generative tasks where the gold standard remains wet-lab experiments. To circumvent the need for exhaustive experimental validation, developing computational benchmarks and metrics that can reliably predict real-world outcomes is indispensable.

Ethical Considerations

Ethical considerations stand at the forefront, given Sci-LLMs' potential impact on sensitive areas like genomics. Data privacy, consent, bias mitigation, misuse prevention, and equitable access to technological benefits are paramount. Integrating ethical principles within Sci-LLMs is as much a technical challenge as it is a moral imperative.

Future Directions

Looking ahead, the survey suggests seven key research directions to hone the capabilities of Sci-LLMs. Among these, expanding the scale of pre-training datasets and incorporating 3D structural data are top priorities. Equally important is refining the evaluation metrics for models, which will be central to validating the generated scientific entities.

Conclusion

Concluding, the survey lays out both the triumphs and tribulations of Sci-LLMs in navigating the complex landscape of scientific languages. By capturing the essence of biological and chemical domains within a computational framework, Sci-LLMs not only accelerate scientific discovery but also pave the way toward more generalized artificial intelligence.

PDF Markdown

GitHub

GitHub - HICAI-ZJU/Scientific-LLM-Survey: Scientific Large Language Models: A Survey on Biological & Chemical Domains (314 stars)

Tweets

https://twitter.com/anthonygitter/status/1753136773687898464

https://twitter.com/qzhang_cs/status/1752209392047419675

https://twitter.com/fly51fly/status/1753549696877089054

https://twitter.com/chilllosopher/status/1781484114643685730