ChemReasoner: Heuristic Search over a Large Language Model's Knowledge Space using Quantum-Chemical Feedback (2402.10980v5)
Abstract: The discovery of new catalysts is essential for the design of new and more efficient chemical processes in order to transition to a sustainable future. We introduce an AI-guided computational screening framework unifying linguistic reasoning with quantum-chemistry based feedback from 3D atomistic representations. Our approach formulates catalyst discovery as an uncertain environment where an agent actively searches for highly effective catalysts via the iterative combination of LLM-derived hypotheses and atomistic graph neural network (GNN)-derived feedback. Identified catalysts in intermediate search steps undergo structural evaluation based on spatial orientation, reaction pathways, and stability. Scoring functions based on adsorption energies and reaction energy barriers steer the exploration in the LLM's knowledge space toward energetically favorable, high-efficiency catalysts. We introduce planning methods that automatically guide the exploration without human input, providing competitive performance against expert-enumerated chemical descriptor-based implementations. By integrating language-guided reasoning with computational chemistry feedback, our work pioneers AI-accelerated, trustworthy catalyst discovery.
- Sustainable conversion of carbon dioxide: an integrated review of catalysis and life cycle assessment. Chemical reviews, 118(2):434–504, 2018.
- Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332, 2023.
- Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376, 2023.
- Biodiesel production viaacid catalysis. Transactions of the ASAE, 42(5):1203–1210, 1999.
- Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208, 2023.
- The open catalyst 2020 (oc20) dataset and community challenges. arxiv. arXiv, 2010.
- Open catalyst 2020 (oc20) dataset and community challenges. ACS Catalysis, 2021. doi: 10.1021/acscatal.0c04525.
- A generalized-template-based graph neural network for accurate organic reactivity prediction. Nature Machine Intelligence, 4(9):772–780, 2022.
- Group selfies: a robust fragment-based molecular string representation. Digital Discovery, 2023.
- Chemberta: Large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885, 2020.
- Unifying molecular and textual representations via multi-task language modelling. arXiv preprint arXiv:2301.12586, 2023.
- Dal Corso, A. Pseudopotentials periodic table: From h to pu. Computational Materials Science, 95:337–350, 2014.
- Co 2 conversion by reverse water gas shift catalysis: comparison of catalysts, mechanisms and their consequences for co 2 conversion to liquid fuels. RSC advances, 6(55):49675–49691, 2016.
- Principles of heterogeneous catalysis. Handbook of Heterogeneous Catalysis: Online, 2008.
- Text2Mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 595–607, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-main.47.
- Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 375–413, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.26. URL https://aclanthology.org/2022.emnlp-main.26.
- Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 375–413, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.26.
- Improving the cu/zno-based catalysts for carbon dioxide hydrogenation to methanol, and the use of methanol as a renewable energy storage media. Frontiers in Energy Research, 8:545431, 2020.
- Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230, 2020.
- Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint arXiv:2306.08018, 2023.
- Gemnet: Universal directional graph neural networks for molecules. Advances in Neural Information Processing Systems, 34:6790–6802, 2021.
- Quantum espresso: a modular and open-source software project for quantum simulations of materials. Journal of physics: Condensed matter, 21(39):395502, 2009.
- Advanced capabilities for materials modelling with quantum espresso. Journal of physics: Condensed matter, 29(46):465901, 2017.
- Electronic structure and catalysis on metal surfaces. Annual review of physical chemistry, 53(1):319–348, 2002.
- A consistent and accurate ab initio parametrization of density functional dispersion correction (dft-d) for the 94 elements h-pu. The Journal of chemical physics, 132(15), 2010.
- What can large language models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint arXiv:2305.18365, 2023.
- Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
- Heterogeneous catalysis: enabling a sustainable future, 2021.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
- Aiida 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance. Scientific data, 7(1):300, 2020.
- Tuning selectivity of co2 hydrogenation reactions at the metal/oxide interface. Journal of the American Chemical Society, 139(29):9739–9754, 2017.
- Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, 2020.
- Knowledge-enhanced biomedical language models. In Journal of Biomedical Informatics, 2023.
- The atomic simulation environment—a python library for working with atoms. Journal of Physics: Condensed Matter, 29(27):273002, 2017. URL http://stacks.iop.org/0953-8984/29/i=27/a=273002.
- Transition into net-zero carbon community from fossil fuels: Life cycle assessment of light-driven co2 conversion to methanol using graphitic carbon nitride. ACS Sustainable Chemistry & Engineering, 11(14):5547–5558, 2023.
- Multi-modal molecule structure-text model for text-based retrieval and editing. arXiv preprint arXiv:2212.10789, 2022.
- Chatgpt-powered conversational drug editing using retrieval and domain feedback. arXiv preprint arXiv:2305.18090, 2023a.
- A text-guided protein design framework. arXiv preprint arXiv:2302.04611, 2023b.
- Molxpt: Wrapping molecules with text for generative pre-training. arXiv preprint arXiv:2305.10688, 2023c.
- Methanol economy and net zero emissions: critical analysis of catalytic processes, reactors and technologies. Green Chemistry, 23(21):8361–8405, 2021.
- Current status and challenges in the heterogeneous catalysis for biodiesel production. Renewable and Sustainable Energy Reviews, 157:112012, 2022.
- Density functional theory in surface chemistry and catalysis. Proceedings of the National Academy of Sciences, 108(3):937–943, 2011.
- NVIDIA Corporation. Megamolbart v0.2, 2022. URL https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/megamolbart_0_2.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- A review of dry (co2) reforming of methane over noble metal catalysts. Chem. Soc. Rev., 43:7813–7837, 2014. doi: 10.1039/C3CS60395D. URL http://dx.doi.org/10.1039/C3CS60395D.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
- Generalized gradient approximation for the exchange-correlation hole of a many-electron system. Physical review B, 54(23):16533, 1996.
- The locus model of search and its use in image interpretation. Cambridge, Massachusetts, pp. 590–595, 1977.
- Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS central science, 5(9):1572–1583, 2019.
- Mapping the space of chemical reactions using attention-based neural networks. Nature Machine Intelligence, 3(2):144–152, 2021.
- Enhancing activity prediction models in drug discovery with the ability to understand human language. arXiv preprint arXiv:2303.03363, 2023.
- Monte carlo thought search: Large language model querying for complex scientific reasoning in catalyst design. arXiv preprint arXiv:2310.14420, 2023.
- Challenges and prospects in solar water splitting and co2 reduction with inorganic and hybrid nanostructures. ACS Catalysis, 8(4):3602–3635, 2018.
- A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481, 2022.
- Unlocking the potential of \ceCO_2 hydrogenation into valuable products using noble metal catalysts: A comprehensive review. Environmental Technology & Innovation, 31:103217, 2023. ISSN 2352-1864. doi: https://doi.org/10.1016/j.eti.2023.103217. URL https://www.sciencedirect.com/science/article/pii/S2352186423002134.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
- The cu–zno synergy in methanol synthesis from \ceCO2, part 2: Origin of the methanol and co selectivities explained by experimental studies and a sphere contact quantification model in randomly packed binary mixtures on cu–zno coprecipitate catalysts. Journal of Catalysis, 330:533–544, 2015. ISSN 0021-9517. doi: https://doi.org/10.1016/j.jcat.2015.04.035. URL https://www.sciencedirect.com/science/article/pii/S0021951715001396.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Can we quickly learn to “translate” bioactive molecules with transformer models? Journal of Chemical Information and Modeling, 63(6):1734–1744, 2023.
- High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation. The Journal of Chemical Physics, 158(8), 2023.
- Bioassayclr: Prediction of biological activity for novel bioassays based on rich textual descriptions. In ELLIS ML4Molecules workshop, 2021.
- Inferring experimental procedures from text-based representations of chemical reactions. Nature communications, 12(1):2573, 2021.
- Ensemble effect in bimetallic electrocatalysts for co2 reduction. Journal of the American Chemical Society, 141(42):16635–16642, 2019. doi: 10.1021/jacs.9b05766. URL https://doi.org/10.1021/jacs.9b05766. PMID: 31509393.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Weininger, D. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
- Smiles. 2. algorithm for generation of unique smiles notation. Journal of chemical information and computer sciences, 29(2):97–101, 1989.
- In-situ xps study for reaction mechanism of methanol decomposition over cu-ni/zn catalyst. ACTA PHYSICOCHIMICA SINICA, 18(1):82–86, 2002.
- Protranslator: zero-shot protein function prediction using textual description. In Research in Computational Molecular Biology: 26th Annual International Conference, RECOMB 2022, San Diego, CA, USA, May 22–25, 2022, Proceedings, pp. 279–294. Springer, 2022.
- Protst: Multi-modality learning of protein sequences and biomedical texts. arXiv preprint arXiv:2301.12040, 2023.
- Theoretical insights into heterogeneous (photo) electrochemical co2 reduction. Chemical reviews, 119(11):6631–6669, 2018.
- Drugassist: A large language model for molecule optimization. arXiv preprint arXiv:2401.10334, 2023.
- A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature communications, 13(1):862, 2022.
- Artificial intelligence for science in quantum, atomistic, and continuum systems. arXiv preprint arXiv:2307.08423, 2023.
- Gimlet: A unified graph-text model for instruction-based molecule zero-shot learning. bioRxiv, pp. 2023–05, 2023a.
- Adversarial modality alignment network for cross-modal molecule retrieval. IEEE Transactions on Artificial Intelligence, 2023b.
- Chemdfm: Dialogue foundation model for chemistry. arXiv preprint arXiv:2401.14818, 2024.
- An introduction to electrocatalyst design using machine learning for renewable energy storage. arXiv preprint arXiv:2010.09435, 2020.