ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements? (2311.17107v1)

Published 28 Nov 2023 in cs.LG, cs.AI, cs.CL, cs.CY, and cs.IR

Abstract: Evaluating the accuracy of outputs generated by LLMs is especially important in the climate science and policy domain. We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest Intergovernmental Panel on Climate Change (IPCC) reports, labeled with their associated confidence levels. Using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. We highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.

References (24)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces the ClimateX dataset, extracted from IPCC AR6 reports, to benchmark LLMs' prediction of expert confidence levels.
It evaluates GPT-3.5-turbo, GPT-4, and Cohere Command-XL using zero-shot and few-shot approaches to analyze accuracy and overconfidence.
Findings show limited accuracy with systematic overconfidence and notable improvement from few-shot prompting, influenced by the models' training data timelines.

This paper investigates the ability of LLMs to accurately assess the confidence levels assigned by human experts to statements about climate science. The core contribution is the ClimateX dataset, a new benchmark resource created for this evaluation.

ClimateX Dataset:

Source: Derived from the latest Intergovernmental Panel on Climate Change (IPCC) Assessment Report 6 (AR6) Working Group I, II, and III reports.
Content: Consists of 8094 statements extracted directly from the reports. Each statement is paired with the confidence level ('low', 'medium', 'high', 'very high') assigned by IPCC experts based on evidence and scientific consensus. The 'very low' confidence category was excluded due to its rarity in the final reports.
Creation: Sentences containing parenthetical confidence labels were extracted using regular expressions after parsing the report PDFs.
Structure: The dataset is split into a training set (7794 statements) and a manually reviewed test set (300 statements) designed to be representative of the full dataset's distribution across confidence levels and report sources.
Availability: The dataset is publicly available on Hugging Face, and the code for the experiments is on GitHub.

Methodology:

Models Tested: OpenAI's GPT-3.5-turbo, GPT-4, and Cohere's Command-XL.
Task: Models were prompted to predict the expert confidence level for statements from the ClimateX test set.
Prompting: Two settings were used:
- Zero-shot: The model was given instructions and the statement with no examples.
- Few-shot: The model was given the instructions, four example statements with their ground-truth confidence labels (one from each class, randomly selected from the train set), and then the target statement. The DSPy library was used for managing few-shot prompting.
Evaluation Metrics:
- Categorical confidence levels ('low', 'medium', 'high', 'very high') were mapped to numerical scores (0, 1, 2, 3).
- Regression Analysis: The relationship between predicted scores and ground-truth scores was analyzed using:
- Slope: Measures how well the model distinguishes between different confidence classes (perfect=1, random=0).
- Bias: Measures systematic over- or under-confidence (unbiased=0, positive=overconfident, negative=underconfident).
- Classification Metrics: Accuracy, Precision, Recall, and F1-score were calculated.

Key Findings:

Limited Accuracy: While LLMs performed better than random guessing (and slightly better than a small sample of non-expert humans), their accuracy in predicting the correct expert confidence level was limited, peaking at 47.0% for GPT-4 in the few-shot setting.
Systematic Overconfidence: All tested models, especially Cohere Command-XL, showed a tendency to overestimate confidence levels, particularly for statements that experts labeled as 'low' or 'medium' confidence.
Few-Shot Improvement: Providing just four examples significantly improved the models' ability to discern confidence levels (increased slope) across all tested models.
Reluctance to Admit Uncertainty: Models rarely responded with "I don't know," even when prompted that it was an option. Less than 4% of responses indicated knowledge limitation, though the initial prompt instruction might have influenced this.
Knowledge Cutoff Impact: Models performed significantly better (higher slope, better discernment) on statements from the WGI report (published before the likely knowledge cutoff date for GPT-3.5-turbo) compared to the WGII/III reports (published after). This suggests performance might be influenced by recalling specific training data rather than a general capability to assess confidence based on the statement's content alone.

Practical Implications:

Climate Communication: The tendency of LLMs to overstate confidence, especially for less certain findings, poses a risk for misinforming the public and policymakers about climate science nuances. Accurate communication of uncertainty is crucial in this domain.
LLM Evaluation: The ClimateX dataset provides a valuable tool for specifically evaluating how well LLMs handle uncertainty and expert confidence, highlighting a critical area for improvement beyond simple fact retrieval accuracy.
Information Retrieval: Using LLMs for retrieving climate information requires caution, as they may present findings with unwarranted certainty.
Future Work: The authors suggest further research into retrieval-augmented methods, fine-tuning open-source models on ClimateX, analyzing linguistic cues for confidence, and establishing a robust human expert baseline.

Implementation Considerations:

The process involves PDF parsing (e.g., using PyPDF2), sentence tokenization (NLTK), and regular expression matching to extract statements and labels.
Interacting with LLM APIs requires careful prompt engineering, especially for few-shot scenarios. Libraries like DSPy can help structure these interactions.
Evaluating performance involves mapping text labels to numerical scores and calculating both classification metrics (accuracy, F1) and regression metrics (slope, bias).
The difference in performance based on knowledge cut-off dates highlights the importance of considering the recency and content of LLM training data when evaluating performance on specific domains or datasets.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rlacombe/status/1798406498940588492

https://twitter.com/rlacombe/status/1799577835557765247

https://twitter.com/rlacombe/status/1873087591790330321

https://twitter.com/rlacombe/status/1910370769605190106