ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on Climate Change (2401.09646v1)

Published 17 Jan 2024 in cs.LG, cs.AI, and cs.CL

Abstract: This paper introduces ClimateGPT, a model family of domain-specific LLMs that synthesize interdisciplinary research on climate change. We trained two 7B models from scratch on a science-oriented dataset of 300B tokens. For the first model, the 4.2B domain-specific tokens were included during pre-training and the second was adapted to the climate domain after pre-training. Additionally, ClimateGPT-7B, 13B and 70B are continuously pre-trained from Llama~2 on a domain-specific dataset of 4.2B tokens. Each model is instruction fine-tuned on a high-quality and human-generated domain-specific dataset that has been created in close cooperation with climate scientists. To reduce the number of hallucinations, we optimize the model for retrieval augmentation and propose a hierarchical retrieval strategy. To increase the accessibility of our model to non-English speakers, we propose to make use of cascaded machine translation and show that this approach can perform comparably to natively multilingual models while being easier to scale to a large number of languages. Further, to address the intrinsic interdisciplinary aspect of climate change we consider different research perspectives. Therefore, the model can produce in-depth answers focusing on different perspectives in addition to an overall answer. We propose a suite of automatic climate-specific benchmarks to evaluate LLMs. On these benchmarks, ClimateGPT-7B performs on par with the ten times larger Llama-2-70B Chat model while not degrading results on general domain benchmarks. Our human evaluation confirms the trends we saw in our benchmarks. All models were trained and evaluated using renewable energy and are released publicly.

References (114)

Citations (24)

View on Semantic Scholar

Summary

The paper introduces ClimateGPT, a suite of domain-specific LLMs that leverage from-scratch and continued pre-training to synthesize interdisciplinary climate change research.
It demonstrates that retrieval augmented generation using a bi-encoder and hierarchical retrieval strategy effectively mitigates hallucinations and enhances factual accuracy.
The study further shows that cascaded machine translation enables scalable multilingual support while ClimateGPT-7B matches the performance of Llama-2-70B Chat on climate benchmarks.

The paper introduces ClimateGPT, a family of domain-specific LLMs designed for synthesizing interdisciplinary climate change research. The models are trained using a combination of from-scratch training and continued pre-training techniques on a large corpus of science-oriented text data. Additionally, the models are instruction fine-tuned using a high-quality, human-generated dataset created in collaboration with climate scientists.

The authors trained two 7B parameter models from scratch on a 300B token dataset. One model, ClimateGPT-FSC-7B, included 4.2B domain-specific tokens during pre-training, while the other, ClimateGPT-FSG-7B, was adapted to the climate domain after pre-training. Additionally, ClimateGPT models with 7B, 13B, and 70B parameters were continuously pre-trained from Llama 2 on a climate-specific dataset of 4.2B tokens. The models were then instruction fine-tuned on a high-quality human-generated climate dataset.

To mitigate hallucinations, the models were optimized for retrieval augmented generation (RAG) using a hierarchical retrieval strategy. This RAG system leverages a bi-encoder model to calculate embeddings of text chunks from crawled climate resources, enabling efficient nearest-neighbor search for relevant documents. The retrieved documents are then concatenated with the user instruction during the generation phase to produce more reliable answers.

To improve accessibility for non-English speakers, the authors propose a cascaded machine translation (MT) approach. This approach translates non-English queries into English, generates an English answer using the underlying LLM, and then translates the answer back to the original language. The authors show this cascaded MT approach performs comparably to natively multilingual models, while being easier to scale.

The ClimateGPT models are designed to address the interdisciplinary nature of climate change by offering different perspectives. The models can produce in-depth answers focusing on environmental and natural science, economics, and social science, in addition to providing a general overview.

The paper introduces a suite of automatic climate-specific benchmarks for evaluating LLMs. On these benchmarks, ClimateGPT-7B performs on par with the ten times larger Llama-2-70B Chat model. Additionally, human evaluations confirm the trends observed in the automatic benchmarks. The ClimateGPT models were trained and evaluated using renewable energy resources, and all models, datasets, and evaluation protocols were publicly released to ensure reproducibility.

The technical approach used in the paper includes several steps:

LLMing: A decoder-only Transformer architecture was used, which is in line with most LLMs. The model represents tokens as continuous-valued hidden vectors, and an attention mechanism models inter-token dependencies. The training criterion is cross-entropy.
From-Scratch (FS) Training: FS training was done to obtain foundation models in the climate domain, with cleaned data. The authors trained both a climate foundation model and a general domain model with a focus on scientific content, to paper the effects of up-sampling domain-specific data during foundation model training.
Continued Pre-Training (CPT): CPT was performed to adapt an existing LLM trained on general domain data to the target domain by continuing pre-training on a smaller set of in-domain data. The Llama-2 model series was used for CPT.
Instruction Fine-Tuning (IFT): IFT was used to inject instruction-following capabilities into the model, using instruction-completion pairs from both the general and climate domains.
Retrieval Augmented Generation (RAG): RAG was implemented with high-quality climate resources to increase factuality and extend the system with new knowledge. A bi-encoder model calculates embeddings, and a nearest-neighbor search is used for relevant document retrieval.
Cascaded Machine Translation (MT): Cascaded MT was used to enable support for multiple languages. Non-English queries are translated to English, then the generated English answer is translated back to the original language.
Benchmarking and Evaluation: Both automatic and human expert evaluation was used. The automatic evaluation included climate-domain tasks and general-domain tasks.

The pre-training dataset included a corpus of roughly 300B tokens from curated sources. Subsets of the dataset and their corresponding weights for model training are shown in \Cref{tab:from_scratch_data}. These include:

News: A web crawl with a focus on relevant news and blog articles, with data from an internal extreme weather index.
Publications: A collection of abstract and full-text papers.
Modern books: Fiction and non-fiction books to help model long-range context.
Patents: Collected mostly from the United States Patent and Trademark Office.
Wikipedia: A recent dump of the English Wikipedia website.
Policy and finance: Text related to law, finance, companies, and stocks in the climate sector.
Science: Other science and climate-related texts like EPA documents and ESG reports.

The 4.2B climate-specific tokens are derived from these sources, with a focus on high-quality sources such as scientific papers and reports from the Intergovernmental Panel on Climate Change (IPCC). The data underwent cleaning, filtering, and pre-processing steps.

The paper includes experiments and results including:

Domain-Specific Pre-Training: The authors compared the continued pre-training and from-scratch training approaches for the climate domain, which provides insights into the tradeoffs for different approaches.
Automatic Evaluation: The results of the automatic evaluation on climate-specific benchmarks show that ClimateGPT-7B performs on par with the ten times larger Llama-2-70B Chat model. These results can be seen in Figure 1.
Human Evaluation: The initial human evaluation comparing model variants confirmed the trends seen in the automatic benchmarks.

A summary of the ClimateGPT model variants is shown in Table 1. This includes the base model, number of tokens, learning rate (LR), and GPU hours.

PDF Markdown

Tweets

https://twitter.com/davidthulke/status/1748417676165161333

https://twitter.com/derekslater/status/1790129110984859986

YouTube

Show All Videos

ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on Climate Change (2401.09646v1)

Summary

Related Papers

Tweets

YouTube