Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering (2303.05352v3)

Published 7 Mar 2023 in cs.CL and cond-mat.mtrl-sci

Abstract: There has been a growing effort to replace manual extraction of data from research papers with automated data extraction based on natural language processing, LLMs, and recently, LLMs. Although these methods enable efficient extraction of data from large sets of research papers, they require a significant amount of up-front effort, expertise, and coding. In this work we propose the ChatExtract method that can fully automate very accurate data extraction with minimal initial effort and background, using an advanced conversational LLM. ChatExtract consists of a set of engineered prompts applied to a conversational LLM that both identify sentences with data, extract that data, and assure the data's correctness through a series of follow-up questions. These follow-up questions largely overcome known issues with LLMs providing factually inaccurate responses. ChatExtract can be applied with any conversational LLMs and yields very high quality data extraction. In tests on materials data we find precision and recall both close to 90% from the best conversational LLMs, like ChatGPT-4. We demonstrate that the exceptional performance is enabled by the information retention in a conversational model combined with purposeful redundancy and introducing uncertainty through follow-up prompts. These results suggest that approaches similar to ChatExtract, due to their simplicity, transferability, and accuracy are likely to become powerful tools for data extraction in the near future. Finally, databases for critical cooling rates of metallic glasses and yield strengths of high entropy alloys are developed using ChatExtract.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (14)
  1. M. C. Swain and J. M. Cole, Chemdataextractor: A toolkit for automated extraction of chemical information from the scientific literature, Journal of Chemical Information and Modeling 56, 1894 (2016).
  2. C. Court and J. Cole, Magnetic and superconducting phase diagrams and transition temperatures predicted using text mining and machine learning, npj Comput Mater 6, 18 (2020).
  3. P. Kumar, S. Kabra, and J. Cole, Auto-generating databases of yield strength and grain size using chemdataextractor, Sci Data 9, 292 (2022).
  4. O. Sierepeklis and J. Cole, A thermoelectric materials database auto-generated from the scientific literature using chemdataextractor, Sci Data 9, 648 (2022).
  5. J. Zhao and J. M. Cole, Reconstructing chromatic-dispersion relations and predicting refractive indices using text mining and machine learning, Journal of Chemical Information and Modeling 62, 2670 (2022a).
  6. J. Zhao and J. Cole, A database of refractive indices and dielectric constants auto-generated using chemdataextractor, Sci Data 9, 192 (2022b).
  7. E. Beard and J. Cole, Perovskite- and dye-sensitized solar-cell device databases auto-generated using chemdataextractor, Sci Data 9, 329 (2022).
  8. Q. Dong and J. Cole, Auto-generated database of semiconductor band gaps using chemdataextractor, Sci Data 9, 193 (2022).
  9. J. E. Saal, A. O. Oliynyk, and B. Meredig, Machine learning in materials discovery: Confirmed predictions and their underlying approaches, Annual Review of Materials Research 50, 49 (2020).
  10. D. Morgan and R. Jacobs, Opportunities and challenges for machine learning in materials science, Annual Review of Materials Research 50, 71 (2020).
  11. J. Zhao and J. M. Cole, Reconstructing chromatic-dispersion relations and predicting refractive indices using text mining and machine learning, Journal of Chemical Information and Modeling 62, 2670 (2022c).
  12. Midjourney, https://www.midjourney.com, [Online; accessed 08-Feb-2023].
  13. M. P. Polak and D. Morgan, Extracting accurate materials data from research papers with conversational language models and prompt engineering,   (2023a), arXiv:2303.05352 .
  14. M. P. Polak and D. Morgan, Datasets and Supporting Information to the paper entitled ’Using conversational AI to automatically extract data from research papers - example of ChatGPT’ 10.6084/m9.figshare.22213747 (2023b).
Citations (96)

Summary

  • The paper introduces ChatExtract, achieving precision of 90.8% and recall of 87.7% in extracting material property triplets.
  • The methodology leverages prompt engineering to mitigate LLM hallucinations and ensures streamlined, high-accuracy data extraction.
  • The technique’s low-barrier approach and adaptability demonstrate its potential for automating diverse materials science research datasets.

Extracting Accurate Materials Data from Research Papers with Conversational LLMs and Prompt Engineering

In this paper, Polak and Morgan present an innovative method, "ChatExtract," developed for high-precision data extraction from materials science literature using conversational LLMs, notably ChatGPT. The paper addresses the challenges faced in the automation of information extraction from research papers, highlighting that prior systems often demand significant initial setup, specific technical knowledge, and bespoke training, which can be restrictive for researchers.

Methodology and Results

ChatExtract leverages advanced conversational LLMs combined with prompt engineering to significantly streamline the extraction of structured data, specifically material property triplets comprising Material, Value, and Unit. The distinct advantage of this method is its low barrier to entry, requiring minimal setup and domain expertise while maintaining high accuracy. In empirical tests, it demonstrates precision and recall nearly reaching 90% with leading LLMs like GPT-4, specifically scoring 90.8% precision and 87.7% recall on a dataset of bulk modulus.

Key to the ChatExtract method are engineered prompts and follow-up questions, which offer a robust solution to the common pitfalls of LLMs, such as the generation of factually incorrect or "hallucinated" data. This prompt engineering approach also introduces purposeful redundancy and uncertainty, which serve to refine the accuracy of data extraction. The paper suggests that this framework, due to its simplicity and flexibility, can be applied across various datasets.

Two significant material science databases were developed using ChatExtract: one concerning critical cooling rates of metallic glasses and another on yield strengths in high entropy alloys. The critical cooling rates database, for example, achieved precision and recall values of 91.6% and 83.6%, respectively, in its standardized form. This demonstrates not only the efficacy of the method but also its potential to handle complex scientific datasets.

Comparative Analysis

The authors conducted a comparative analysis between ChatExtract and existing methods, such as ChemDataExtractor 2 (CDE2). ChatExtract exhibited superior performance in all tested scenarios, emphasizing the utility of conversational LLMs when enhanced by meticulous prompt engineering. Furthermore, assessments against different LLMs (including LLaMA2-chat) highlight ChatExtract's adaptability and room for performance improvements as LLMs evolve.

Implications and Future Work

Practically, ChatExtract presents a versatile tool to expedite materials data extraction, enhancing the accessibility and depth of database compilation. Theoretically, it implies a shift towards more generalizable and less resource-intensive solutions in NLP applications for scientific research. The approach is posited to improve in tandem with advances in LLM capabilities, suggesting an evolving landscape where AI-augmented tools could progressively diminish the manual labor in data extraction processes across domains, extending far beyond materials science.

The future scope for ChatExtract involves exploring its adaptability to a broader range of properties and conditions, potentially incorporating multiparameter triplets for more granular analyses. Additionally, further integration with figure and table data extraction hints at more comprehensive, automated academic content parsing systems.

In conclusion, Polak and Morgan's ChatExtract marks a step forward in the efficient and accurate extraction of data using conversational AI, setting a benchmark for forthcoming methodologies in AI-augmented scientific data mining. The approach not only facilitates current research endeavors but also sets the stage for more robust and automated data management ecosystems in scientific communities.

Youtube Logo Streamline Icon: https://streamlinehq.com