GeoGalactica: A Scientific Large Language Model in Geoscience (2401.00434v2)
Abstract: LLMs have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in NLP. Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest LLM for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.
- A unified framework of temporal information expression in geosciences knowledge system. Geoscience Frontiers, 2022.
- Gakg: A multimodal geoscience academic knowledge graph. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021.
- Language model for earth science: Exploring potential downstream applications as well as current challenges. IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, pages 4015–4018, 2022.
- When geoscience meets foundation models: Towards general geoscience artificial intelligence system. arXiv preprint arXiv:2309.06799, 2023.
- Geoscience keyphrase extraction algorithm using enhanced word embedding. Expert Systems with Applications, 125:157–169, 2019.
- Applications of natural language processing to geoscience text data and prospectivity modeling. Natural Resources Research, pages 1–25, 2023.
- Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques. Earth Science Informatics, 13:1393–1410, 2020.
- Dgeosegmenter: A dictionary-based chinese word segmenter for the geoscience domain. Computers & geosciences, 121:1–11, 2018.
- Information extraction and knowledge graph construction from geoscience literature. Computers & geosciences, 112:112–120, 2018.
- Pk-chat: Pointer network guided knowledge driven generative dialogue model. arXiv preprint arXiv:2304.00592, 2023.
- What is this article about? generative summarization with the bert model in the geosciences domain. Earth Science Informatics, pages 1–16, 2022.
- Galactica: A large language model for science. ArXiv, abs/2211.09085, 2022.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
- Glm-130b: An open bilingual pre-trained model. ArXiv, abs/2210.02414, 2022.
- Learning a foundation language model for geoscience knowledge understanding and utilization. ArXiv, abs/2306.05064, 2023.
- Toward earthquake early warning: A convolutional neural network for repaid earthquake magnitude estimation. Artificial Intelligence in Geosciences, 2023.
- Unsupervised pre-stack seismic facies analysis constrained by spatial continuity. Artificial Intelligence in Geosciences, 2023.
- Deep convolutional autoencoders as generic feature extractors in seismological applications. Artificial intelligence in geosciences, 2:96–106, 2021.
- Machine learning elucidates the anatomy of buried carbonate reef from seismic reflection data. Artificial Intelligence in Geosciences, 4:59–67, 2023.
- Application of machine learning for lithofacies prediction and cluster analysis approach to identify rock type. Energies, 15(12):4501, 2022.
- Reda Abdel Azim. A new correlation for calculating wellhead oil flow rate using artificial neural network. Artificial Intelligence in Geosciences, 3:1–7, 2022.
- Application of machine learning in carbon capture and storage: An in-depth insight from the perspective of geoscience. Fuel, 333:126296, 2023.
- Prediction of geology condition for slurry pressure balanced shield tunnel with super-large diameter by machine learning algorithms. Tunnelling and Underground Space Technology, 131:104852, 2023.
- High resolution pre-stack seismic inversion using few-shot learning. Artificial Intelligence in Geosciences, 3:203–208, 2022.
- Integrating the artificial intelligence and hybrid machine learning algorithms for improving the accuracy of spatial prediction of landslide hazards in kurseong himalayan region. Artificial Intelligence in Geosciences, 3:14–27, 2022.
- Machine learning for data-driven discovery in solid earth geoscience. Science, 363(6433), mar 2019.
- Similarity of fast and slow earthquakes illuminated by machine learning. Nature Geoscience, 12(1):69–74, dec 2018.
- Machine learning reveals climate forcing from aerosols is dominated by increased cloud cover. Nature Geoscience, 15(8):609–614, aug 2022.
- Geo-bert pre-training model for query rewriting in poi search. In Conference on Empirical Methods in Natural Language Processing, 2021.
- Construction and application of a knowledge graph for iron deposits using text mining analytics and a deep learning algorithm. Mathematical Geosciences, 55(3):423–456, 2023.
- Understanding geological reports based on knowledge graphs using a deep learning approach. Computers & Geosciences, 168:105229, 2022.
- Geoscience language processing for exploration. Day 3 Wed, November 17, 2021, 2021.
- Neurospe: A neuro-net spatial relation extractor for natural language text fusing gazetteers and pretrained models. Transactions in GIS.
- Scibert: A pretrained language model for scientific text. In Conference on Empirical Methods in Natural Language Processing, 2019.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
- Bloomberggpt: A large language model for finance. ArXiv, abs/2303.17564, 2023.
- Biogpt: Generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 2022.
- Darwin series: Domain specific large language models for natural science. ArXiv, abs/2308.13565, 2023.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36:1234 – 1240, 2019.
- Clinicalbert: Modeling clinical notes and predicting hospital readmission. ArXiv, abs/1904.05342, 2019.
- Legal-bert: The muppets straight out of law school. ArXiv, abs/2010.02559, 2020.
- Geogpt: Understanding and processing geospatial tasks through an autonomous gpt. arXiv preprint arXiv:2307.07930, 2023.
- Impressiongpt: an iterative optimizing framework for radiology report summarization with chatgpt. arXiv preprint arXiv:2304.08448, 2023.
- Soft prompt tuning for augmenting dense retrieval with large language models. arXiv preprint arXiv:2307.08303, 2023.
- wikipedia. History of artificial neural networks. 2023.
- Deepshovel: An online collaborative platform for data extraction in geoscience literature with ai assistance. arXiv preprint arXiv:2202.10163, 2022.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
- Grobid: A machine learning software for extracting information from scholarly documents. https://github.com/kermitt2/grobid, 2008–2023.
- Pdffigures 2.0: Mining figures from research papers (jcdl’16). 143–152. Google Scholar Google Scholar Digital Library Digital Library, 2016.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. ArXiv, abs/2304.01196, 2023.
- restructured pre-training. ArXiv, abs/2206.11147, 2022.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Radiology-gpt: A large language model for radiology. arXiv preprint arXiv:2306.08666, 2023.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv, abs/2307.16789, 2023.
- Michael D. McCormack. Neural computing in geophysics. Geophysics, 10:11–15, 1991.
- Dave Hale. Methods to compute fault images, extract fault surfaces, and estimate fault throws from 3d seismic images. Geophysics, 78, 2013.
- Anders U. Waldeland and Anne H. Schistad Solberg. Salt classification using deep learning. 2017.
- Adaptive minimum prediction-error deconvolution and source wavelet estimation using hopfield neural networks. Geophysics, 57:670–679, 1992.
- Deep reinforcement learning for optimal well control in subsurface systems with uncertain geology. J. Comput. Phys., 477:111945, 2022.
- Supervised learning to detect salt body. Seg Technical Program Expanded Abstracts, 2015.
- Heidi Anderson Kuzma. A support vector machine for avo interpretation. Seg Technical Program Expanded Abstracts, pages 181–184, 2003.
- Machine learning can extract the information needed for modelling and data analysing from unstructured documents. 2017.
- Using Transformer Networks and Knowledge Graphs in Earth Science Literature to Synthesize Mass Information for Transdisciplinary Research. In AGU Fall Meeting Abstracts, volume 2020, pages IN030–04, December 2020.
- Graph convolutional networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 59:5966–5978, 2020.
- BERT-E: An Earth Science Specific Language Model for Domain-Specific Downstream Tasks. In AGU Fall Meeting Abstracts, volume 2021, pages IN15B–06, December 2021.
- Efficient large-scale language model training on gpu clusters using megatron-lm. SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
- Zhouhan Lin (57 papers)
- Cheng Deng (67 papers)
- Le Zhou (8 papers)
- Tianhang Zhang (16 papers)
- Yi Xu (304 papers)
- Yutong Xu (3 papers)
- Zhongmou He (5 papers)
- Yuanyuan Shi (62 papers)
- Beiya Dai (4 papers)
- Yunchong Song (6 papers)
- Boyi Zeng (4 papers)
- Qiyuan Chen (22 papers)
- Shu Wang (176 papers)
- Luoyi Fu (41 papers)
- Weinan Zhang (322 papers)
- Junxian He (66 papers)
- Yunqiang Zhu (9 papers)
- Xinbing Wang (98 papers)
- Chenghu Zhou (55 papers)
- Yuxun Miao (3 papers)