Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis (2403.01976v5)

Published 4 Mar 2024 in cs.CL

Abstract: Recent breakthroughs in LLMs have revolutionized scientific literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency of LLMs in this domain, particularly in scenarios requiring higher-level abilities beyond mere memorization and the handling of multimodal data. In response to this gap, we introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis. It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), Comprehension (L2), and Analysis & Reasoning (L3). It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine. To ensure the reliability of SciAssess, rigorous quality control measures have been implemented, ensuring accuracy, anonymization, and compliance with copyright standards. SciAssess evaluates 11 LLMs, highlighting their strengths and areas for improvement. We hope this evaluation supports the ongoing development of LLM applications in scientific literature analysis. SciAssess and its resources are available at \url{https://github.com/sci-assess/SciAssess}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  2. Gemini Team Google. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023.
  3. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  4. Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712, 2023.
  5. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021.
  6. Agieval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023.
  7. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. CoRR, abs/2305.08322, 2023.
  8. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615, 2022.
  9. Challenging big-bench tasks and whether chain-of-thought can solve them. In ACL (Findings), pages 13003–13051. Association for Computational Linguistics, 2023.
  10. The impact of large language models on scientific discovery: a preliminary study using GPT-4. CoRR, abs/2311.07361, 2023.
  11. Large language models for scientific synthesis, inference and explanation. CoRR, abs/2310.07984, 2023.
  12. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024.
  13. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  14. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  15. Scieval: A multi-level large language model evaluation benchmark for scientific research. arXiv preprint arXiv:2308.13149, 2023.
  16. Krathwohl, d. r. (2002). a revision of bloom’s taxonomy: An overview. theory into practice, 41 (4), 212- 218. 2009.
  17. Pierre Caron and T Khan. Improvement of creep strength in a nickel-base single-crystal superalloy by heat treatment. Materials Science and Engineering, 61(2):173–184, 1983.
  18. Prediction of reversible α𝛼\alphaitalic_α/γ𝛾\gammaitalic_γ phase transformation in multi-pass weld of fe-cr-ni ternary alloy by phase-field method. Journal of Advanced Joining Processes, 4:100067, 2021.
  19. Physical characterization of sintered nimnga ferromagnetic shape memory alloy. Materials, 13(21):4806, 2020.
  20. Evaluation of hardening and softening behaviors in zn–21al–2cu alloy processed by equal channel angular pressing. Journal of Materials Research and Technology, 6(4):329–333, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (23)
  1. Hengxing Cai (14 papers)
  2. Xiaochen Cai (8 papers)
  3. Junhan Chang (8 papers)
  4. Sihang Li (32 papers)
  5. Lin Yao (37 papers)
  6. Changxin Wang (7 papers)
  7. Zhifeng Gao (36 papers)
  8. Yongge Li (3 papers)
  9. Mujie Lin (4 papers)
  10. Shuwen Yang (10 papers)
  11. Jiankun Wang (61 papers)
  12. Yuqi Yin (2 papers)
  13. Yaqi Li (18 papers)
  14. Linfeng Zhang (160 papers)
  15. Guolin Ke (43 papers)
  16. Hongshuai Wang (7 papers)
  17. Mingjun Xu (7 papers)
  18. Jin Huang (80 papers)
  19. Xi Fang (26 papers)
  20. Jiaxi Zhuang (3 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com