Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 98 tok/s Pro
GPT OSS 120B 424 tok/s Pro
Kimi K2 164 tok/s Pro
2000 character limit reached

EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models (2405.11265v1)

Published 18 May 2024 in cs.CL and cs.AI

Abstract: In the field of environmental science, it is crucial to have robust evaluation metrics for LLMs to ensure their efficacy and accuracy. We propose EnviroExam, a comprehensive evaluation method designed to assess the knowledge of LLMs in the field of environmental science. EnviroExam is based on the curricula of top international universities, covering undergraduate, master's, and doctoral courses, and includes 936 questions across 42 core courses. By conducting 0-shot and 5-shot tests on 31 open-source LLMs, EnviroExam reveals the performance differences among these models in the domain of environmental science and provides detailed evaluation standards. The results show that 61.3% of the models passed the 5-shot tests, while 48.39% passed the 0-shot tests. By introducing the coefficient of variation as an indicator, we evaluate the performance of mainstream open-source LLMs in environmental science from multiple perspectives, providing effective criteria for selecting and fine-tuning LLMs in this field. Future research will involve constructing more domain-specific test sets using specialized environmental science textbooks to further enhance the accuracy and specificity of the evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Use of chatgpt: What does it mean for biology and environmental science? Science of The Total Environment, 888:164154, 2023.
  2. Anthropic. Model card and evaluations for claude models, 2023.
  3. Qwen technical report, 2023.
  4. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023.
  5. Medbench: A large-scale chinese benchmark for evaluating medical large language models, 2023.
  6. Evaluating large language models trained on code, 2021.
  7. China Academic Degrees and Graduate Education Development Center. The fourth round of discipline evaluation results. https://www.cdgdc.edu.cn/dslxkpgjggb/, 2023.
  8. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models, 2023.
  9. Lawbench: Benchmarking legal knowledge of large language models, 2023.
  10. Measuring massive multitask language understanding, 2021.
  11. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023.
  12. Capabilities of large language models in control engineering: A benchmark study on gpt-4, claude 3 opus, and gemini 1.0 ultra, 2024.
  13. Lawgpt. Lawgpt, 2023.
  14. OpenAI. Chatgpt: Applications, opportunities, and threats, 2023.
  15. OpenAI. Gpt-4 technical report, 2024.
  16. Performance of some estimators of relative variability. Frontiers in Applied Mathematics and Statistics, 5, 2019.
  17. Rylan Schaeffer. Pretraining on the test set is all you need, 2023.
  18. Llama: Open and efficient foundation language models, 2023.
  19. Attention is all you need, 2023.
  20. Chatgpt and environmental research. Environmental Science & Technology, 57(46):17667–17670, 2023. PMID: 36943179.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets