Papers
Topics
Authors
Recent
2000 character limit reached

ChemMiner: A Large Language Model Agent System for Chemical Literature Data Mining

Published 20 Feb 2024 in cs.IR, cs.AI, cs.LG, and q-bio.QM | (2402.12993v2)

Abstract: The development of AI-assisted chemical synthesis tools requires comprehensive datasets covering diverse reaction types, yet current high-throughput experimental (HTE) approaches are expensive and limited in scope. Chemical literature represents a vast, underexplored data source containing thousands of reactions published annually. However, extracting reaction information from literature faces significant challenges including varied writing styles, complex coreference relationships, and multimodal information presentation. This paper proposes ChemMiner, a novel end-to-end framework leveraging multiple agents powered by LLMs to extract high-fidelity chemical data from literature. ChemMiner incorporates three specialized agents: a text analysis agent for coreference mapping, a multimodal agent for non-textual information extraction, and a synthesis analysis agent for data generation. Furthermore, we developed a comprehensive benchmark with expert-annotated chemical literature to evaluate both extraction efficiency and precision. Experimental results demonstrate reaction identification rates comparable to human chemists while significantly reducing processing time, with high accuracy, recall, and F1 scores. Our open-sourced benchmark facilitates future research in chemical literature data mining.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (10)
  1. K. Chen, G. Chen, J. Li, Y. Huang, E. Wang, T. Hou, and P.-A. Heng, “MetaRF: attention-based random forest for reaction yield prediction with a few trails,” Journal of Cheminformatics, vol. 15, no. 1, pp. 1–12, 2023.
  2. K. Chen, J. Li, K. Wang, Y. Du, J. Yu, J. Lu, G. Chen, L. Li, J. Qiu, Q. Fang et al., “Towards an automatic ai agent for reaction condition recommendation in chemical synthesis,” arXiv preprint arXiv:2311.10776, 2023.
  3. H. Cui, Y. Du, Q. Yang, Y. Shao, and S. C. Liew, “Llmind: Orchestrating ai and iot with llms for complex task execution,” arXiv preprint arXiv:2312.09007, 2023.
  4. Y. Du, S. C. Liew, K. Chen, and Y. Shao, “The power of large language models for wireless communication system development: A case study on fpga platforms,” arXiv preprint arXiv:2307.07319, 2023.
  5. J. Guo, A. S. Ibanez-Lopez, H. Gao, V. Quach, C. W. Coley, K. F. Jensen, and R. Barzilay, “Automated chemical reaction extraction from scientific literature,” Journal of chemical information and modeling, vol. 62, no. 9, pp. 2035–2045, 2021.
  6. D. T. Ahneman, J. G. Estrada, S. Lin, S. D. Dreher, and A. G. Doyle, “Predicting reaction performance in c–n cross-coupling using machine learning,” Science, vol. 360, no. 6385, pp. 186–190, 2018.
  7. D. Perera, J. W. Tucker, S. Brahmbhatt, C. J. Helal, A. Chong, W. Farrell, P. Richardson, and N. W. Sach, “A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow,” Science, vol. 359, no. 6374, pp. 429–434, 2018.
  8. J. Schleinitz, M. Langevin, Y. Smail, B. Wehnert, L. Grimaud, and R. Vuilleumier, “Machine learning yield prediction from nicolit, a small-size literature data set of nickel catalyzed c–o couplings,” Journal of the American Chemical Society, vol. 144, no. 32, pp. 14 722–14 730, 2022.
  9. M. C. Swain and J. M. Cole, “Chemdataextractor: a toolkit for automated extraction of chemical information from the scientific literature,” Journal of chemical information and modeling, vol. 56, no. 10, pp. 1894–1904, 2016.
  10. Z. Zheng, O. Zhang, C. Borgs, J. T. Chayes, and O. M. Yaghi, “Chatgpt chemistry assistant for text mining and the prediction of mof synthesis,” Journal of the American Chemical Society, vol. 145, no. 32, pp. 18 048–18 062, 2023, pMID: 37548379.
Citations (4)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.