LLMs are rapidly transforming research in materials science and chemistry by enabling new capabilities across the scientific lifecycle, from initial idea generation to data analysis and communication. Based on the outcomes of the second LLM Hackathon for Applications in Materials Science and Chemistry, this paper (Zimmermann et al., 5 May 2025 ) reviews 34 projects demonstrating practical applications of LLMs in these fields. These projects span seven key areas, showcasing LLMs as versatile tools for prediction, design, automation, education, data management, hypothesis generation, and knowledge extraction.
The projects collectively form a "constellation of capabilities" applicable throughout the research process.
The seven key areas explored are:
- Molecular and Material Property Prediction: LLMs are used to forecast properties, demonstrating particular effectiveness in low-data scenarios and integrating diverse data types. An exemplar project leveraged Llama 3 models fine-tuned on textual descriptions enriched with orbital-based bonding analysis information from the LobsterPy package, alongside structural data from Robocrystallographer. This approach aimed to predict the highest-frequency phonon density of states peak for crystalline materials, showing improved performance (MAE of 38 cm) when bonding information was included compared to using structural data alone (MAE of 44 cm). The use of tools like Unsloth highlighted the accessibility of integrating domain data for fine-tuning, while architectural considerations, such as adapting encoder-decoder models for regression, were noted for future optimization.
- Molecular and Material Design: LLMs assist in generating and optimizing novel materials. A project focused on designing low band gap Metal-Organic Frameworks (MOFs) using a chemistry-informed ReAct AI Agent. This agent, powered by GPT-4 and built with LangChain, utilizes tools for Retrieval-Augmented Generation (RAG) to extract design guidelines from scientific papers, a surrogate MOFormer model ensemble fine-tuned on QMOF data for band gap prediction and uncertainty estimation, and an RDKit-based chemical feasibility evaluator. The agent iteratively suggests MOF candidates by modifying SMILES strings based on retrieved guidelines, validating them chemically, and predicting their band gap, employing a self-correction loop for invalid or higher-band gap suggestions. This demonstrates a practical closed-loop design workflow. Another project explored using smaller LLMs (like Llama 2) for designing sustainable concrete formulations, highlighting the potential of more resource-efficient models.
- Automation and Novel Interfaces: LLMs enable the development of natural language interfaces and automated workflows for complex scientific tasks. The LangSim project prototyped an LLM-based interface using LangChain to control atomistic simulations (via pyiron and MACE forcefield) with natural language. This allows users without programming expertise to initiate simulations, calculate properties, and even attempt inverse materials design, like finding alloy compositions for a target bulk modulus. LLMicroscopilot demonstrated an agent prototype assisting scanning transmission electron microscope operations. This system used an LLM agent to interpret user commands and interact with a microscope experiment simulation tool's API, aiming to make complex microscopy tasks more accessible and reduce reliance on expert operators. Future work involves integrating with real hardware control tools and using RAG for parameter estimation.
- Scientific Communication and Education: LLMs can enhance academic writing and educational tools. The MaSTeA (Materials Science Teaching Assistant) project developed an interactive Streamlit web app to evaluate various LLMs (including Llama3, Claude, and GPT-4 variants) on the MaScQA dataset of materials science questions. The evaluation across question types (MCQ, MATCH, NUM) and topics revealed that larger models like Claude 3 Opus and GPT-4 generally outperformed smaller models, though significant room for improvement remains, especially in numerical reasoning. The interactive interface provides step-by-step model reasoning, offering a practical tool for students to learn and identify LLM limitations. Strategies like RAG and self-consistency were suggested to improve performance.
- Research Data Management and Automation: LLMs streamline handling, organization, and processing of scientific data. The yeLLowhaMMer project created a multimodal (text and image input) LLM-based agent capable of interacting with an electronic lab notebook/laboratory information management system (datalab ELN/LIMS) via its API. The agent can interpret instructions to query, summarize, or add data entries, even from images of handwritten notes, by iteratively writing and executing Python code. This highlights how LLMs can augment traditional data management interfaces. The NOMAD Query Reporter project focused on automating the generation of scientific narratives (like method sections) from large, heterogeneous materials science databases like NOMAD. It employs a RAG approach with Llama3-70B, feeding schema-aware data contextually via a multi-turn conversation style to generate structured summaries, demonstrating potential for automating reporting but highlighting challenges with highly diverse data.
- Hypothesis Generation and Evaluation: LLMs can assist in generating and evaluating scientific hypotheses. The "Multi-Agent Hypothesis Generation and Verification" project designed a system using multiple agents (retrieval, inspiration, hypothesis generation, evaluation) for sustainable concrete design. Leveraging RAG to access relevant abstracts and employing a Tree-of-Thoughts structure, the system generated and evaluated hypotheses based on feasibility, utility, and novelty using an LLM-as-a-judge framework. This prototype demonstrated the potential of structured AI systems to accelerate the early stages of scientific inquiry.
- Knowledge Extraction and Reasoning: LLMs excel at extracting structured information and performing reasoning from scientific literature. The ActiveScience framework uses an ontology-driven RAG approach to ingest scientific articles (via ArXiv API) into a Neo4j knowledge graph using GPT-3.5 Turbo to extract triples. It enables natural language queries on the graph via LangChain's GraphCypherQAChain, mitigating LLM hallucination by grounding responses in extracted knowledge. GlossaGen automated glossary creation from academic articles (PDF/TeX) using LLMs (GPT-3.5/4 Turbo) with techniques like Typed Predictors and Chain-of-Thought prompting, presenting results as both glossaries and knowledge graphs, and providing a Gradio interface for user interaction. ChemQA introduced a multimodal QA dataset and benchmark for chemistry reasoning, showing that models (Gemini Pro, GPT-4 Turbo, Claude 3 Opus) perform best with combined text and image inputs but struggle with image-only chemistry data, highlighting the need for improved multimodal understanding in the domain.
The hackathon itself served as an effective framework for rapid prototyping and exploring these diverse applications, fostering a global, interdisciplinary community. While the projects demonstrate significant promise and leverage recent advancements in LLM capabilities, practical challenges remain. These include concerns about reproducibility with proprietary models, the substantial computational resources required for training and inference on large models, and the need for continued refinement to improve reliability, interpretability, and handle complex, heterogeneous scientific data effectively. The findings suggest that integrating LLMs into scientific workflows requires ongoing research and collaborative efforts to fully realize their potential for accelerating discovery.