EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models (2405.11265v1)
Abstract: In the field of environmental science, it is crucial to have robust evaluation metrics for LLMs to ensure their efficacy and accuracy. We propose EnviroExam, a comprehensive evaluation method designed to assess the knowledge of LLMs in the field of environmental science. EnviroExam is based on the curricula of top international universities, covering undergraduate, master's, and doctoral courses, and includes 936 questions across 42 core courses. By conducting 0-shot and 5-shot tests on 31 open-source LLMs, EnviroExam reveals the performance differences among these models in the domain of environmental science and provides detailed evaluation standards. The results show that 61.3% of the models passed the 5-shot tests, while 48.39% passed the 0-shot tests. By introducing the coefficient of variation as an indicator, we evaluate the performance of mainstream open-source LLMs in environmental science from multiple perspectives, providing effective criteria for selecting and fine-tuning LLMs in this field. Future research will involve constructing more domain-specific test sets using specialized environmental science textbooks to further enhance the accuracy and specificity of the evaluation.
- Use of chatgpt: What does it mean for biology and environmental science? Science of The Total Environment, 888:164154, 2023.
- Anthropic. Model card and evaluations for claude models, 2023.
- Qwen technical report, 2023.
- Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023.
- Medbench: A large-scale chinese benchmark for evaluating medical large language models, 2023.
- Evaluating large language models trained on code, 2021.
- China Academic Degrees and Graduate Education Development Center. The fourth round of discipline evaluation results. https://www.cdgdc.edu.cn/dslxkpgjggb/, 2023.
- OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models, 2023.
- Lawbench: Benchmarking legal knowledge of large language models, 2023.
- Measuring massive multitask language understanding, 2021.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023.
- Capabilities of large language models in control engineering: A benchmark study on gpt-4, claude 3 opus, and gemini 1.0 ultra, 2024.
- Lawgpt. Lawgpt, 2023.
- OpenAI. Chatgpt: Applications, opportunities, and threats, 2023.
- OpenAI. Gpt-4 technical report, 2024.
- Performance of some estimators of relative variability. Frontiers in Applied Mathematics and Statistics, 5, 2019.
- Rylan Schaeffer. Pretraining on the test set is all you need, 2023.
- Llama: Open and efficient foundation language models, 2023.
- Attention is all you need, 2023.
- Chatgpt and environmental research. Environmental Science & Technology, 57(46):17667–17670, 2023. PMID: 36943179.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.