What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks (2305.18365v3)

Published 27 May 2023 in cs.CL and cs.AI

Abstract: LLMs with strong abilities in natural language processing tasks have emerged and have been applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper, rather than pursuing state-of-the-art performance, we aim to evaluate capabilities of LLMs in a wide range of tasks across the chemistry domain. We identify three key chemistry-related capabilities including understanding, reasoning and explaining to explore in LLMs and establish a benchmark containing eight chemistry tasks. Our analysis draws on widely recognized datasets facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama and Galactica) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. Our investigation found that GPT-4 outperformed other models and LLMs exhibit different competitive levels in eight chemistry tasks. In addition to the key findings from the comprehensive benchmark analysis, our work provides insights into the limitation of current LLMs and the impact of in-context learning settings on LLMs' performance across various chemistry tasks. The code and datasets used in this study are available at https://github.com/ChemFoundationModels/ChemLLMBench.

Citations (82)

View on Semantic Scholar

Summary

The paper systematically benchmarks five LLMs across eight chemistry tasks to evaluate understanding, reasoning, and explaining.
The paper reveals that models like GPT-4 excel in language-related tasks but struggle with precise molecule translation and complex chemical structure interpretation.
The paper demonstrates that in-context learning significantly improves performance and underscores the need for chemistry-specific metrics and enhanced molecular understanding.

Analyzing the Capabilities of LLMs in Chemistry: An Evaluation Across Eight Tasks

The capabilities of LLMs have seen extensive exploration and application across numerous domains, with promising outcomes in fields like natural language processing and finance. However, their specific utility in the discipline of chemistry is not fully understood. To address this understanding gap, a scholarly paper titled "What can LLMs do in chemistry? A comprehensive benchmark on eight tasks" provides a comprehensive analysis of the performance of various LLMs across different chemistry-related tasks.

Research Overview

This paper’s primary objective is to systematically benchmark the performance of five widely recognized LLMs—GPT-4, GPT-3.5, Davinci-003, Llama, and Galactica—on a set of eight chemistry-specific tasks. These tasks are strategically selected to evaluate three key chemistry-related skills: understanding, reasoning, and explaining. The paper's framework is structured to provide a comprehensive assessment of these models through widely accepted datasets in chemistry to assess the practical capabilities of LLMs. The tasks include name prediction, property prediction, yield prediction, reaction prediction, retrosynthesis, text-based molecule design, molecule captioning, and reagents selection.

Key Findings and Results

The investigation reveals several notable insights regarding the performance of LLMs in chemistry:

Strengths in Language-Related Tasks: LLMs like GPT-4 demonstrate substantial capabilities in tasks that involve language understanding and explanation, such as molecule captioning and text-based molecule design. The performance on these tasks often surpasses that of traditional baselines like MolT5-Large when evaluated using language generation metrics such as BLEU and Levenshtein distance.
Limitations in Molecule Translation: GPT models exhibit significant limitations in tasks that require precise translation between chemical representations, such as SMILES to IUPAC name conversion. These models generally perform poorly in reaction prediction and retrosynthesis tasks, owing to a lack of comprehensive understanding of molecular structures and processes.
In-Context Learning Advantages: The paper finds that the performance of LLMs significantly improves when in-context learning (ICL) is employed with carefully selected examples. In particular, scaffold-based retrieval of similar examples enhances the predictive accuracy of these models.
Competitive Performance in Classification Tasks: For tasks that can be framed as classification problems, such as yield prediction and reagent selection, GPT models appear competitive, achieving comparable results to specialized machine learning models trained on large datasets.

Implications and Future Directions

The insights derived from this paper imply that, while LLMs hold potential in chemistry, their application is constrained by certain limitations. LLMs excel in tasks requiring language processing but face challenges in tasks necessitating a deep understanding of complex chemical structures.

Need for Enhanced Molecular Understanding: To overcome the limitations recognized in molecule translation tasks, future work should focus on improving LLMs' comprehension of chemical structures. This could be achieved by integrating domain-specific knowledge or coupling LLMs with existing chemical toolkits such as RDKit.
Development of Chemistry-Specific Metrics: Current evaluation metrics may not fully capture the efficacy of LLMs in chemistry-related challenges. There is a need for developing more domain-specific metrics that can better assess the validity and utility of generated outputs, particularly in molecule design and captioning tasks.
Addressing Hallucinations: The occurrence of hallucinations, where models generate chemically unreasonable outputs, is a significant concern. Mitigating this issue is crucial for safely utilizing LLMs in chemistry and entails developing methods to incorporate constraints or additional checks into the generation process.

Conclusion

The benchmark paper provides a valuable contribution to understanding the utility and limitations of LLMs in the field of chemistry. While LLMs exhibit strong potential in certain tasks, substantial efforts are required to address their deficiencies in understanding and processing molecular structures. Further advancements in LLM architectures and evaluation metrics will be critical in enhancing their applicability and effectiveness in chemistry. This research serves as a foundation for future exploration and innovation in integrating artificial intelligence within scientific disciplines.

PDF Markdown

Related Papers

GitHub

GitHub - ChemFoundationModels/ChemLLMBench: What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks (124 stars)