Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements? (2311.17107v1)

Published 28 Nov 2023 in cs.LG, cs.AI, cs.CL, cs.CY, and cs.IR

Abstract: Evaluating the accuracy of outputs generated by LLMs is especially important in the climate science and policy domain. We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest Intergovernmental Panel on Climate Change (IPCC) reports, labeled with their associated confidence levels. Using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. We highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., 2009.
  2. Assessing Large Language Models on climate information, 2023.
  3. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4, 2023.
  4. Cohere. Cohere’s Command Model, 2023.
  5. The PyPDF2 library, 2022.
  6. Universal language model fine-tuning for text classification, 2018.
  7. Scott Janzwood. Confident, likely, or both? The implementation of the uncertainty language framework in IPCC special reports. Climatic Change, 162(3):1655–1675, October 2020.
  8. Language Models (Mostly) Know What They Know, 2022.
  9. Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP, 2023.
  10. Expert Confidence in Climate Statements (ClimateX) dataset. https://huggingface.co/datasets/rlacombe/ClimateX, 2023.
  11. ClimaBench: A Benchmark Dataset For Climate Change Text Understanding in English, 2023.
  12. Teaching Models to Express Their Uncertainty in Words, 2022.
  13. Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, 2021.
  14. Guidance Note for Lead Authors of the IPCC Fifth Assessment Report on Consistent Treatment of Uncertainties, 2010.
  15. The Onion. The Onion: America’s Finest News Source, 2023.
  16. OpenAI. Models, 2023.
  17. Climate Change 2022: Impacts, Adaptation and Vulnerability. Contribution of Working Group II to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, UK and New York, USA, 2022.
  18. Climate Change 2022: Mitigation of Climate Change. Contribution of Working Group III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, UK and New York, USA, 2022.
  19. B. Thomas. Onion News Articles Dataset, 2023.
  20. FEVER: a large-scale dataset for fact extraction and VERification. In NAACL-HLT, 2018.
  21. chatClimate: Grounding conversational ai in climate science, 2023.
  22. Climatext: A dataset for climate change topic detection, 2021.
  23. ReCOGS: How incidental details of a logical form overshadow an evaluation of semantic interpretation, 2023.
  24. Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models, 2023.
Citations (3)

Summary

  • The paper introduces the ClimateX dataset, extracted from IPCC AR6 reports, to benchmark LLMs' prediction of expert confidence levels.
  • It evaluates GPT-3.5-turbo, GPT-4, and Cohere Command-XL using zero-shot and few-shot approaches to analyze accuracy and overconfidence.
  • Findings show limited accuracy with systematic overconfidence and notable improvement from few-shot prompting, influenced by the models' training data timelines.

This paper investigates the ability of LLMs to accurately assess the confidence levels assigned by human experts to statements about climate science. The core contribution is the ClimateX dataset, a new benchmark resource created for this evaluation.

ClimateX Dataset:

  • Source: Derived from the latest Intergovernmental Panel on Climate Change (IPCC) Assessment Report 6 (AR6) Working Group I, II, and III reports.
  • Content: Consists of 8094 statements extracted directly from the reports. Each statement is paired with the confidence level ('low', 'medium', 'high', 'very high') assigned by IPCC experts based on evidence and scientific consensus. The 'very low' confidence category was excluded due to its rarity in the final reports.
  • Creation: Sentences containing parenthetical confidence labels were extracted using regular expressions after parsing the report PDFs.
  • Structure: The dataset is split into a training set (7794 statements) and a manually reviewed test set (300 statements) designed to be representative of the full dataset's distribution across confidence levels and report sources.
  • Availability: The dataset is publicly available on Hugging Face, and the code for the experiments is on GitHub.

Methodology:

  • Models Tested: OpenAI's GPT-3.5-turbo, GPT-4, and Cohere's Command-XL.
  • Task: Models were prompted to predict the expert confidence level for statements from the ClimateX test set.
  • Prompting: Two settings were used:
    • Zero-shot: The model was given instructions and the statement with no examples.
    • Few-shot: The model was given the instructions, four example statements with their ground-truth confidence labels (one from each class, randomly selected from the train set), and then the target statement. The DSPy library was used for managing few-shot prompting.
  • Evaluation Metrics:
    • Categorical confidence levels ('low', 'medium', 'high', 'very high') were mapped to numerical scores (0, 1, 2, 3).
    • Regression Analysis: The relationship between predicted scores and ground-truth scores was analyzed using:
    • Slope: Measures how well the model distinguishes between different confidence classes (perfect=1, random=0).
    • Bias: Measures systematic over- or under-confidence (unbiased=0, positive=overconfident, negative=underconfident).
    • Classification Metrics: Accuracy, Precision, Recall, and F1-score were calculated.

Key Findings:

  1. Limited Accuracy: While LLMs performed better than random guessing (and slightly better than a small sample of non-expert humans), their accuracy in predicting the correct expert confidence level was limited, peaking at 47.0% for GPT-4 in the few-shot setting.
  2. Systematic Overconfidence: All tested models, especially Cohere Command-XL, showed a tendency to overestimate confidence levels, particularly for statements that experts labeled as 'low' or 'medium' confidence.
  3. Few-Shot Improvement: Providing just four examples significantly improved the models' ability to discern confidence levels (increased slope) across all tested models.
  4. Reluctance to Admit Uncertainty: Models rarely responded with "I don't know," even when prompted that it was an option. Less than 4% of responses indicated knowledge limitation, though the initial prompt instruction might have influenced this.
  5. Knowledge Cutoff Impact: Models performed significantly better (higher slope, better discernment) on statements from the WGI report (published before the likely knowledge cutoff date for GPT-3.5-turbo) compared to the WGII/III reports (published after). This suggests performance might be influenced by recalling specific training data rather than a general capability to assess confidence based on the statement's content alone.

Practical Implications:

  • Climate Communication: The tendency of LLMs to overstate confidence, especially for less certain findings, poses a risk for misinforming the public and policymakers about climate science nuances. Accurate communication of uncertainty is crucial in this domain.
  • LLM Evaluation: The ClimateX dataset provides a valuable tool for specifically evaluating how well LLMs handle uncertainty and expert confidence, highlighting a critical area for improvement beyond simple fact retrieval accuracy.
  • Information Retrieval: Using LLMs for retrieving climate information requires caution, as they may present findings with unwarranted certainty.
  • Future Work: The authors suggest further research into retrieval-augmented methods, fine-tuning open-source models on ClimateX, analyzing linguistic cues for confidence, and establishing a robust human expert baseline.

Implementation Considerations:

  • The process involves PDF parsing (e.g., using PyPDF2), sentence tokenization (NLTK), and regular expression matching to extract statements and labels.
  • Interacting with LLM APIs requires careful prompt engineering, especially for few-shot scenarios. Libraries like DSPy can help structure these interactions.
  • Evaluating performance involves mapping text labels to numerical scores and calculating both classification metrics (accuracy, F1) and regression metrics (slope, bias).
  • The difference in performance based on knowledge cut-off dates highlights the importance of considering the recency and content of LLM training data when evaluating performance on specific domains or datasets.