K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization (2306.05064v2)

Published 8 Jun 2023 in cs.CL and cs.AI

Abstract: LLMs have achieved great success in general domains of natural language processing. In this paper, we bring LLMs to the realm of geoscience with the objective of advancing research and applications in this field. To this end, we present the first-ever LLM in geoscience, K2, alongside a suite of resources developed to further promote LLM research within geoscience. For instance, we have curated the first geoscience instruction tuning dataset, GeoSignal, which aims to align LLM responses to geoscience-related user queries. Additionally, we have established the first geoscience benchmark, GeoBench, to evaluate LLMs in the context of geoscience. In this work, we experiment with a complete recipe to adapt a pre-trained general-domain LLM to the geoscience domain. Specifically, we further train the LLaMA-7B model on 5.5B tokens of geoscience text corpus, including over 1 million pieces of geoscience literature, and utilize GeoSignal's supervised data to fine-tune the model. Moreover, we share a protocol that can efficiently gather domain-specific data and construct domain-supervised data, even in situations where manpower is scarce. Meanwhile, we equip K2 with the abilities of using tools to be a naive geoscience aide. Experiments conducted on the GeoBench demonstrate the effectiveness of our approach and datasets on geoscience knowledge understanding and utilization.We open-source all the training data and K2 model checkpoints at https://github.com/davendw49/k2.

PDF Abstract

An Expert Overview of the K2 LLM for Geoscience Knowledge

The paper presents K2, a specialized LLM (LM) calibrated specifically for the geoscience domain. K2 emerges as the first geoscience-focused LLM, constructed by further training an existing general-domain model (LLaMA-7B) on a substantial corpus of geoscience-specific texts, involving approximately 5.5 billion tokens collected from academic literature, Wikipedia pages, and other pertinent sources.

Key Contributions

Specialized Tuning Data: The authors introduce a novel instruction tuning dataset named GeoSignal, crafted to enhance LMs' alignment with geoscience-related queries. Additionally, they propose GeoBench, a benchmark explicitly designed to evaluate LMs' proficiency in geoscience tasks, comprising objective and subjective questions derived from academic and standardized tests.
Training and Fine-Tuning: K2's training pipeline includes two principal stages—further pre-training on domain-specific texts and instruction tuning using GeoSignal and other datasets. This approach intends to align K2 with both general human instructions and expert-level geoscience knowledge.
Tool Augmentation: The model is endowed with tool-use capabilities, enabling it to support tasks typical for geoscientists, such as academic literature searches through a tool named GeoSearch. This feature allows K2 to autonomously draw upon external resources, enhancing its value as a research and knowledge assistant.

Evaluation and Performance

The effectiveness of K2 is demonstrated through comprehensive evaluations outlined in GeoBench. The model outperforms similar-sized models, leveraging both objective metrics and human evaluators. In particular, K2 showcases superior capability in knowledge understanding, and reasoning, significantly distinguishing itself in subjective evaluations where understanding and expertise in geoscience are crucial.

Theoretical and Practical Implications

Practically, K2 stands to serve as a substantial asset for geoscientists, functioning as an intelligent assistant capable of engaging with complex data queries, conducting searches, and generating coherent, informed responses that align with expert knowledge. Theoretically, this work suggests potential advancements in domain-specific LLM adaptation. The structured approach to data collection, preprocessing, and instruction tuning reveals a scalable blueprint for similar endeavors in other specialized fields. The integration of tools further emphasizes the expanding role of LMs in incorporating dynamic datasets into ongoing interactions within specialized domains.

Future Developments

Looking ahead, this research sets a foundational platform for progressively larger and more sophisticated models within the geoscience sector. Potential developments may include expanding K2's capacity to incorporate real-time data flows, enhancing its adaptability with more sophisticated tool integrations, and exploring its application to other interdisciplinary areas within natural and life sciences.

Conclusion

The introduction of K2 aligns with a growing trajectory towards specialized LMs that address particular academic domains. This model not only enriches the resources available to geoscientists but also illustrates the broader applicability of specialized LMs across various complex domains requiring a granular understanding of the subject matter. The open-sourced data and tools pave the way for further developments and collaborations within the geoscience community and beyond.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Cheng Deng (67 papers)
Tianhang Zhang (16 papers)
Zhongmou He (5 papers)
Yi Xu (304 papers)
Qiyuan Chen (22 papers)
Yuanyuan Shi (62 papers)
Luoyi Fu (41 papers)
Weinan Zhang (322 papers)
Xinbing Wang (98 papers)
Chenghu Zhou (55 papers)
Zhouhan Lin (57 papers)
Junxian He (66 papers)

Citations (43)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - davendw49/k2: Code and datasets for paper "K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization" in WSDM-2024 (174 stars)