Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GeoGalactica: A Scientific Large Language Model in Geoscience (2401.00434v2)

Published 31 Dec 2023 in cs.CL

Abstract: LLMs have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in NLP. Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest LLM for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. A unified framework of temporal information expression in geosciences knowledge system. Geoscience Frontiers, 2022.
  2. Gakg: A multimodal geoscience academic knowledge graph. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021.
  3. Language model for earth science: Exploring potential downstream applications as well as current challenges. IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, pages 4015–4018, 2022.
  4. When geoscience meets foundation models: Towards general geoscience artificial intelligence system. arXiv preprint arXiv:2309.06799, 2023.
  5. Geoscience keyphrase extraction algorithm using enhanced word embedding. Expert Systems with Applications, 125:157–169, 2019.
  6. Applications of natural language processing to geoscience text data and prospectivity modeling. Natural Resources Research, pages 1–25, 2023.
  7. Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques. Earth Science Informatics, 13:1393–1410, 2020.
  8. Dgeosegmenter: A dictionary-based chinese word segmenter for the geoscience domain. Computers & geosciences, 121:1–11, 2018.
  9. Information extraction and knowledge graph construction from geoscience literature. Computers & geosciences, 112:112–120, 2018.
  10. Pk-chat: Pointer network guided knowledge driven generative dialogue model. arXiv preprint arXiv:2304.00592, 2023.
  11. What is this article about? generative summarization with the bert model in the geosciences domain. Earth Science Informatics, pages 1–16, 2022.
  12. Galactica: A large language model for science. ArXiv, abs/2211.09085, 2022.
  13. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  14. Glm-130b: An open bilingual pre-trained model. ArXiv, abs/2210.02414, 2022.
  15. Learning a foundation language model for geoscience knowledge understanding and utilization. ArXiv, abs/2306.05064, 2023.
  16. Toward earthquake early warning: A convolutional neural network for repaid earthquake magnitude estimation. Artificial Intelligence in Geosciences, 2023.
  17. Unsupervised pre-stack seismic facies analysis constrained by spatial continuity. Artificial Intelligence in Geosciences, 2023.
  18. Deep convolutional autoencoders as generic feature extractors in seismological applications. Artificial intelligence in geosciences, 2:96–106, 2021.
  19. Machine learning elucidates the anatomy of buried carbonate reef from seismic reflection data. Artificial Intelligence in Geosciences, 4:59–67, 2023.
  20. Application of machine learning for lithofacies prediction and cluster analysis approach to identify rock type. Energies, 15(12):4501, 2022.
  21. Reda Abdel Azim. A new correlation for calculating wellhead oil flow rate using artificial neural network. Artificial Intelligence in Geosciences, 3:1–7, 2022.
  22. Application of machine learning in carbon capture and storage: An in-depth insight from the perspective of geoscience. Fuel, 333:126296, 2023.
  23. Prediction of geology condition for slurry pressure balanced shield tunnel with super-large diameter by machine learning algorithms. Tunnelling and Underground Space Technology, 131:104852, 2023.
  24. High resolution pre-stack seismic inversion using few-shot learning. Artificial Intelligence in Geosciences, 3:203–208, 2022.
  25. Integrating the artificial intelligence and hybrid machine learning algorithms for improving the accuracy of spatial prediction of landslide hazards in kurseong himalayan region. Artificial Intelligence in Geosciences, 3:14–27, 2022.
  26. Machine learning for data-driven discovery in solid earth geoscience. Science, 363(6433), mar 2019.
  27. Similarity of fast and slow earthquakes illuminated by machine learning. Nature Geoscience, 12(1):69–74, dec 2018.
  28. Machine learning reveals climate forcing from aerosols is dominated by increased cloud cover. Nature Geoscience, 15(8):609–614, aug 2022.
  29. Geo-bert pre-training model for query rewriting in poi search. In Conference on Empirical Methods in Natural Language Processing, 2021.
  30. Construction and application of a knowledge graph for iron deposits using text mining analytics and a deep learning algorithm. Mathematical Geosciences, 55(3):423–456, 2023.
  31. Understanding geological reports based on knowledge graphs using a deep learning approach. Computers & Geosciences, 168:105229, 2022.
  32. Geoscience language processing for exploration. Day 3 Wed, November 17, 2021, 2021.
  33. Neurospe: A neuro-net spatial relation extractor for natural language text fusing gazetteers and pretrained models. Transactions in GIS.
  34. Scibert: A pretrained language model for scientific text. In Conference on Empirical Methods in Natural Language Processing, 2019.
  35. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  36. Bloomberggpt: A large language model for finance. ArXiv, abs/2303.17564, 2023.
  37. Biogpt: Generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 2022.
  38. Darwin series: Domain specific large language models for natural science. ArXiv, abs/2308.13565, 2023.
  39. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36:1234 – 1240, 2019.
  40. Clinicalbert: Modeling clinical notes and predicting hospital readmission. ArXiv, abs/1904.05342, 2019.
  41. Legal-bert: The muppets straight out of law school. ArXiv, abs/2010.02559, 2020.
  42. Geogpt: Understanding and processing geospatial tasks through an autonomous gpt. arXiv preprint arXiv:2307.07930, 2023.
  43. Impressiongpt: an iterative optimizing framework for radiology report summarization with chatgpt. arXiv preprint arXiv:2304.08448, 2023.
  44. Soft prompt tuning for augmenting dense retrieval with large language models. arXiv preprint arXiv:2307.08303, 2023.
  45. wikipedia. History of artificial neural networks. 2023.
  46. Deepshovel: An online collaborative platform for data extraction in geoscience literature with ai assistance. arXiv preprint arXiv:2202.10163, 2022.
  47. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  48. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
  49. Grobid: A machine learning software for extracting information from scholarly documents. https://github.com/kermitt2/grobid, 2008–2023.
  50. Pdffigures 2.0: Mining figures from research papers (jcdl’16). 143–152. Google Scholar Google Scholar Digital Library Digital Library, 2016.
  51. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  52. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. ArXiv, abs/2304.01196, 2023.
  53. restructured pre-training. ArXiv, abs/2206.11147, 2022.
  54. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  55. Radiology-gpt: A large language model for radiology. arXiv preprint arXiv:2306.08666, 2023.
  56. Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv, abs/2307.16789, 2023.
  57. Michael D. McCormack. Neural computing in geophysics. Geophysics, 10:11–15, 1991.
  58. Dave Hale. Methods to compute fault images, extract fault surfaces, and estimate fault throws from 3d seismic images. Geophysics, 78, 2013.
  59. Anders U. Waldeland and Anne H. Schistad Solberg. Salt classification using deep learning. 2017.
  60. Adaptive minimum prediction-error deconvolution and source wavelet estimation using hopfield neural networks. Geophysics, 57:670–679, 1992.
  61. Deep reinforcement learning for optimal well control in subsurface systems with uncertain geology. J. Comput. Phys., 477:111945, 2022.
  62. Supervised learning to detect salt body. Seg Technical Program Expanded Abstracts, 2015.
  63. Heidi Anderson Kuzma. A support vector machine for avo interpretation. Seg Technical Program Expanded Abstracts, pages 181–184, 2003.
  64. Machine learning can extract the information needed for modelling and data analysing from unstructured documents. 2017.
  65. Using Transformer Networks and Knowledge Graphs in Earth Science Literature to Synthesize Mass Information for Transdisciplinary Research. In AGU Fall Meeting Abstracts, volume 2020, pages IN030–04, December 2020.
  66. Graph convolutional networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 59:5966–5978, 2020.
  67. BERT-E: An Earth Science Specific Language Model for Domain-Specific Downstream Tasks. In AGU Fall Meeting Abstracts, volume 2021, pages IN15B–06, December 2021.
  68. Efficient large-scale language model training on gpu clusters using megatron-lm. SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (21)
  1. Zhouhan Lin (57 papers)
  2. Cheng Deng (67 papers)
  3. Le Zhou (8 papers)
  4. Tianhang Zhang (16 papers)
  5. Yi Xu (304 papers)
  6. Yutong Xu (3 papers)
  7. Zhongmou He (5 papers)
  8. Yuanyuan Shi (62 papers)
  9. Beiya Dai (4 papers)
  10. Yunchong Song (6 papers)
  11. Boyi Zeng (4 papers)
  12. Qiyuan Chen (22 papers)
  13. Shu Wang (176 papers)
  14. Luoyi Fu (41 papers)
  15. Weinan Zhang (322 papers)
  16. Junxian He (66 papers)
  17. Yunqiang Zhu (9 papers)
  18. Xinbing Wang (98 papers)
  19. Chenghu Zhou (55 papers)
  20. Yuxun Miao (3 papers)
Citations (16)

Summary

  • The paper introduces GeoGalactica, a 30 billion parameter LLM pre-trained on 65 billion geoscience tokens and fine-tuned with 1 million geoscience-specific QA pairs.
  • It employs extensive pre-training and supervised fine-tuning to adapt the model for complex geoscience data analysis and tasks.
  • Benchmarking by senior geoscientists shows GeoGalactica sets a new standard for applying AI in geoscience research.

Introduction to GeoGalactica

The field of NLP has witnessed significant advancements with the introduction of LLMs. LLMs have showcased their capacity to tackle a broad range of tasks across multiple domains. Humanity's endeavor to explore and comprehend our planet has made the integration of NLP and geoscience a natural progression. The combination can lead to groundbreaking discoveries by analyzing vast data outputs from Earth science research.

Tailoring AI for Geoscience

Given the complexity and diversity of geoscience data, a specialized approach is necessary to optimize the capabilities of LLMs in this domain. To create a model specifically adept at geoscience applications, a 30 billion parameter LLM, GeoGalactica, was developed using two main steps: extensive pre-training on geoscience texts, and a fine-tuning process known as Supervised Fine-Tuning (SFT). This procedure honed the model using an instruction tuning dataset containing one million geoscience-specific question-answer pairs.

GeoGalactica's initial base is the Galactica model – itself a product of rigorous training on scientific documentation. The pre-training process leveraged a corpus of geoscience texts totaling 65 billion tokens, making it the largest dataset ever curated for geoscience. This vast dataset was drawn from the Deep-time Digital Earth project's comprehensive archive of geological knowledge.

Benchmarking GeoGalactica

To evaluate the performance of GeoGalactica, various geoscience exams and domain-related open questions were answered by the AI, with results assessed by a panel of senior geoscientists. In these tests, GeoGalactica demonstrated superior capability in an array of geoscience NLP tasks and showed its potential when applied to geoscience-related tools, setting a new benchmark for scientific LLMs in the geoscience field.

Open Resource and Community Contribution

Committing to the principles of open science, the project has made data curation tools and pre-training checkpoints of GeoGalactica publicly accessible on GitHub. This decision aligns with the broader intent to aid the research community in understanding the creation and functioning of a domain-specific LLM like GeoGalactica. Furthermore, insights from this research are expected to fuel further exploration into the utility of AI in various scientific disciplines, catalyzing progress within AI for science.

Beyond the model itself, additional offerings include tools for data cleaning, structured data sets for instruction-based learning, and methods for model and data analysis. These contributions stand as testaments to the overarching goal of enhancing community knowledge in the literature and promoting the interoperability of LLMs within specific scientific domains.