Emergent Mind

Are large language models superhuman chemists?

Published Apr 1, 2024 in cs.LG , cond-mat.mtrl-sci , cs.AI , and physics.chem-ph


Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. This is relevant for the chemical sciences, which face the problem of small and diverse datasets that are frequently in the form of text. LLMs have shown promise in addressing these issues and are increasingly being harnessed to predict chemical properties, optimize reactions, and even design and conduct experiments autonomously. However, we still have only a very limited systematic understanding of the chemical reasoning capabilities of LLMs, which would be required to improve models and mitigate potential harms. Here, we introduce "ChemBench," an automated framework designed to rigorously evaluate the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of human chemists. We curated more than 7,000 question-answer pairs for a wide array of subfields of the chemical sciences, evaluated leading open and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. The models, however, struggle with some chemical reasoning tasks that are easy for human experts and provide overconfident, misleading predictions, such as about chemicals' safety profiles. These findings underscore the dual reality that, although LLMs demonstrate remarkable proficiency in chemical tasks, further research is critical to enhancing their safety and utility in chemical sciences. Our findings also indicate a need for adaptations to chemistry curricula and highlight the importance of continuing to develop evaluation frameworks to improve safe and useful LLMs.
ChemBench framework components including a benchmark corpus, model evaluations, expert surveys, and public leaderboards.


  • ChemBench is a framework to evaluate chemical knowledge and reasoning of LLMs against human chemists, highlighting a need to understand LLM capabilities in chemical sciences.

  • The study uses over 7,000 question-answer pairs across various chemical science topics, benchmarking LLMs against human experts, revealing areas of strength and weaknesses in LLMs.

  • Claude 3 LLM outperforms human chemists on average, demonstrating LLM proficiency in chemical tasks, yet variability exists across subfields and challenges in tool integration persist.

  • The analysis underscores LLMs' unreliable confidence in their answers and advocates for future improvements, including better model-human interaction and emphasis on critical reasoning in chemistry education.


ChemBench presents an innovative framework designed to assess the chemical knowledge and reasoning abilities of state-of-the-art LLMs in comparison with human chemists. This framework addresses the current gap in a systematic understanding of LLMs' capabilities within the chemical sciences domain, a gap that needs bridging to enhance models and mitigate potential harms. The comprehensive study evaluates the performance of leading LLMs across over 7,000 question-answer pairs, covering a broad spectrum of the chemical sciences, and benchmarks these against the expertise of human chemists.

Benchmark Corpus

The ChemBench corpus is meticulously curated from diverse sources, ensuring a rigorous challenge across a breadth of topics within the chemical sciences. The dataset is notably diverse, featuring questions ranging from multiple-choice to open-ended formats, reflecting real-world applications more accurately than current benchmarks. Moreover, a "tiny" subset of the corpus facilitates routine evaluation practices, balancing broad coverage with pragmatic constraints of computational resources and time.

Model Evaluation

Comprehensive Evaluation

The evaluation of various models using ChemBench reveals intriguing insights. Of note, Claude 3, a leading LLM, demonstrates a performance that exceeds even the best human chemists in the study on average. This indicates a remarkable proficiency in chemical tasks by current LLMs. However, the performance spectrum of LLMs varies widely across different subfields of chemistry, highlighting areas of strengths and weaknesses.

Tool-Augmented Systems

The study also extends into the realm of tool-augmented systems, showing their current limitations due to computational constraints. This underlines the necessity for advancements in tool integration to fully exploit the potential of LLMs in chemical reasoning tasks.

Performance Analysis

The detailed examination of performance across different topics uncovers a nuanced understanding of LLM capabilities. For specific topics like macromolecular chemistry and biochemistry, LLMs show impressive strengths. However, challenges remain in areas such as chemical safety and analytical chemistry, where LLMs struggle with tasks that require nuanced chemical reasoning or safety assessments.

Confidence Estimates

Assessing LLMs' ability to self-evaluate their confidence in their answers provides critical insights into their reliability. The analysis indicates that, for some models, confidence estimates do not reliably correlate with the correctness of the answers. This highlights a crucial area for future development, aiming for LLMs that can accurately gauge and communicate their confidence levels.

Conclusions and Future Directions

The groundbreaking analysis provided by ChemBench paves the way for a deeper understanding of the chemical reasoning capabilities of LLMs. While showcasing the significant achievements of LLMs in surpassing human performance on average in chemical tasks, the study also reveals the pressing need for improvements in certain areas. The findings advocate for a shift in chemistry education towards developing critical reasoning skills, given the evolving landscape of chemical research with the advent of LLMs.

Looking forward, ChemBench sets the stage for continuous advancement and evaluation of LLMs in the chemical sciences. By offering a deep dive into the strengths and limitations of current models, this study not only directs the focus towards crucial areas for improvement but also ignites discussion on the optimization of model-human interaction. The comprehensive benchmark framework established by ChemBench thus marks a significant step forward in realizing the full potential of LLMs in furthering chemical sciences.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

Test Your Knowledge

You answered out of questions correctly.

Well done!