Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

147 32

GeoGalactica: A Scientific Large Language Model in Geoscience (2401.00434v2)

Published 31 Dec 2023 in cs.CL

Abstract: LLMs have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in NLP. Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest LLM for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.

References (68)

Authors (21)

Zhouhan Lin (57 papers)
Cheng Deng (67 papers)
Le Zhou (8 papers)
Tianhang Zhang (16 papers)
Yi Xu (304 papers)
Yutong Xu (3 papers)
Zhongmou He (5 papers)
Yuanyuan Shi (62 papers)
Beiya Dai (4 papers)
Yunchong Song (6 papers)
Boyi Zeng (4 papers)
Qiyuan Chen (22 papers)
Shu Wang (176 papers)
Luoyi Fu (41 papers)
Weinan Zhang (322 papers)
Junxian He (66 papers)
Yunqiang Zhu (9 papers)
Xinbing Wang (98 papers)
Chenghu Zhou (55 papers)
Yuxun Miao (3 papers)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces GeoGalactica, a 30 billion parameter LLM pre-trained on 65 billion geoscience tokens and fine-tuned with 1 million geoscience-specific QA pairs.
It employs extensive pre-training and supervised fine-tuning to adapt the model for complex geoscience data analysis and tasks.
Benchmarking by senior geoscientists shows GeoGalactica sets a new standard for applying AI in geoscience research.

Introduction to GeoGalactica

The field of NLP has witnessed significant advancements with the introduction of LLMs. LLMs have showcased their capacity to tackle a broad range of tasks across multiple domains. Humanity's endeavor to explore and comprehend our planet has made the integration of NLP and geoscience a natural progression. The combination can lead to groundbreaking discoveries by analyzing vast data outputs from Earth science research.

Tailoring AI for Geoscience

Given the complexity and diversity of geoscience data, a specialized approach is necessary to optimize the capabilities of LLMs in this domain. To create a model specifically adept at geoscience applications, a 30 billion parameter LLM, GeoGalactica, was developed using two main steps: extensive pre-training on geoscience texts, and a fine-tuning process known as Supervised Fine-Tuning (SFT). This procedure honed the model using an instruction tuning dataset containing one million geoscience-specific question-answer pairs.

GeoGalactica's initial base is the Galactica model – itself a product of rigorous training on scientific documentation. The pre-training process leveraged a corpus of geoscience texts totaling 65 billion tokens, making it the largest dataset ever curated for geoscience. This vast dataset was drawn from the Deep-time Digital Earth project's comprehensive archive of geological knowledge.

Benchmarking GeoGalactica

To evaluate the performance of GeoGalactica, various geoscience exams and domain-related open questions were answered by the AI, with results assessed by a panel of senior geoscientists. In these tests, GeoGalactica demonstrated superior capability in an array of geoscience NLP tasks and showed its potential when applied to geoscience-related tools, setting a new benchmark for scientific LLMs in the geoscience field.

Open Resource and Community Contribution

Committing to the principles of open science, the project has made data curation tools and pre-training checkpoints of GeoGalactica publicly accessible on GitHub. This decision aligns with the broader intent to aid the research community in understanding the creation and functioning of a domain-specific LLM like GeoGalactica. Furthermore, insights from this research are expected to fuel further exploration into the utility of AI in various scientific disciplines, catalyzing progress within AI for science.

Beyond the model itself, additional offerings include tools for data cleaning, structured data sets for instruction-based learning, and methods for model and data analysis. These contributions stand as testaments to the overarching goal of enhancing community knowledge in the literature and promoting the interoperability of LLMs within specific scientific domains.

PDF Markdown

GitHub

GitHub - geobrain-ai/geogalactica: A Larger foundation language model in Geoscience. Code and datasets for paper "GeoGalactica: A Scientific Large Language Model in Geoscience" with arXiv:2401.00434 (32 stars)

Tweets

https://twitter.com/2465283662/status/1742035516571230259

https://twitter.com/22146921/status/1742301239638110558