Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Agent-based Learning of Materials Datasets from Scientific Literature (2312.11690v1)

Published 18 Dec 2023 in cs.AI
Agent-based Learning of Materials Datasets from Scientific Literature

Abstract: Advancements in machine learning and artificial intelligence are transforming materials discovery. Yet, the availability of structured experimental data remains a bottleneck. The vast corpus of scientific literature presents a valuable and rich resource of such data. However, manual dataset creation from these resources is challenging due to issues in maintaining quality and consistency, scalability limitations, and the risk of human error and bias. Therefore, in this work, we develop a chemist AI agent, powered by LLMs, to overcome these challenges by autonomously creating structured datasets from natural language text, ranging from sentences and paragraphs to extensive scientific research articles. Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles, scientists, the Internet and other tools altogether. We benchmark the performance of our approach in three different information extraction tasks with various levels of complexity, including solid-state impurity doping, metal-organic framework (MOF) chemical formula, and property relations. Our results demonstrate that our zero-shot agent, with the appropriate tools, is capable of attaining performance that is either superior or comparable to the state-of-the-art fine-tuned materials information extraction methods. This approach simplifies compilation of machine learning-ready datasets for various materials discovery applications, and significantly ease the accessibility of advanced natural language processing tools for novice users in natural language. The methodology in this work is developed as an open-source software on https://github.com/AI4ChemS/Eunomia.

Introduction

Advancements in AI have significantly influenced materials science, particularly in transforming the vast amount of unstructured text from scientific literature into structured, machine-readable datasets. Collecting and structuring such data is central for computational models to facilitate materials discovery and design. Challenges in manual data extraction, including maintaining quality, scalability, and mitigating human error and bias, underscore the need for automated methods. An AI agent named Eunomia has been developed to address these challenges, extracting information from various types of text within scientific papers, and developing structured datasets.

AI Agent Fundamentals

An AI agent is essentially a system capable of autonomous action to achieve goals based on environmental information. The chemist AI agent, Eunomia, is built around a LLM and is equipped with sophisticated capabilities. It plans and acts by drawing upon domain-specific knowledge bases and tools. The core LLM is augmented with a series of domain-specific toolkits that can extract vital information from a variety of text formats, ranging from sentences to full-fledged scientific papers. These toolkits, including document search, Chain-of-Verification, dataset search, and CSV generation, enable the agent to perform complex tasks efficiently. The agent's design allows for robust extraction of information while minimizing the generation of erroneous content, known as hallucinations.

Case Studies and Performance

The AI agent's effectiveness is demonstrated through a series of case studies of varying complexity. From simple named entity recognition tasks to extracting material properties from lengthy research papers, Eunomia showcases remarkable precision and adaptability. Notably, for tasks like identifying the water stability of metal-organic frameworks (MOFs) from research papers, the agent can not only spot the relevant MOFs mentioned but also ascertain their stability based on defined criteria, improving the dependability of dataset generation.

Advantages and Future Directions

Eunomia sets itself apart from other methods with its user-friendly nature. Domain experts can effortlessly instruct the AI in natural language, obviating the need for intensive programming skills. Options for human insight in the loop present pathways for enhancing transparency and reducing the 'black box' aspects of AI. Moreover, the zero-shot learning capability of the AI agent — its ability to perform without task-specific training — implies considerable potential in a broad array of domain-specific tasks. This adaptability and ease of use may revolutionize how databases are developed from scholarly texts. Future work could explore the impact of varying prompts on the AI agent's performance to gain deeper insight into its zero-shot learning capabilities.

The tools and methods used in this paper are shared as open-source software, reflecting a commitment to collaborative advancement and the wider application of AI in scientific research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Mehrad Ansari (8 papers)
  2. Seyed Mohamad Moosavi (6 papers)
Citations (7)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com