Fine-Tuned Language Models Generate Stable Inorganic Materials as Text (2402.04379v1)

Published 6 Feb 2024 in cs.LG and cond-mat.mtrl-sci

Abstract: We propose fine-tuning LLMs for generation of stable materials. While unorthodox, fine-tuning LLMs on text-encoded atomistic data is simple to implement yet reliable, with around 90% of sampled structures obeying physical constraints on atom positions and charges. Using energy above hull calculations from both learned ML potentials and gold-standard DFT calculations, we show that our strongest model (fine-tuned LLaMA-2 70B) can generate materials predicted to be metastable at about twice the rate (49% vs 28%) of CDVAE, a competing diffusion model. Because of text prompting's inherent flexibility, our models can simultaneously be used for unconditional generation of stable material, infilling of partial structures and text-conditional generation. Finally, we show that LLMs' ability to capture key symmetries of crystal structures improves with model scale, suggesting that the biases of pretrained LLMs are surprisingly well-suited for atomistic data.

Citations (32)

View on Semantic Scholar

Summary

The paper demonstrates that fine-tuning LLaMA-2 on text-encoded crystallographic data generates stable inorganic materials with strong physical plausibility.
It employs a parameter-efficient method that achieves up to 90% compliance with physical constraints and doubles the metastability rate to 49% compared to diffusion models.
Empirical evaluation using ML potentials and DFT validates material stability, highlighting the potential for AI-assisted discovery in materials science.

Introduction

LLMs have garnered significant attention for their versatile capabilities in handling various types of data. Their generalizability and sample efficiency make them potent tools for tackling scientific challenges, particularly where data scarcity and complexity are prevalent, such as in materials science. Drawing upon the in-context learning ability of LLMs, we've demonstrated an innovative application by generating stable inorganic material structures through fine-tuning on text-encoded crystallographic data.

Methodology

Our methodology hinges on leveraging the base LLM, LLaMA-2, which is parameter-efficiently fine-tuned on crystallographic data encoded in a text-like format. The encoding process transforms crystal lattice dimensions and atom positions into strings, affording us a convenient means to interact with materials data using text prompts. These prompts empower the model to perform a range of generative tasks from unconditional generation of new materials to conditional generation based on desired properties, and even partial structure infilling.

This model understands and utilizes key symmetries innate to crystal structures—an aspect critical to the physical plausibility of generated materials. Our models achieve up to 90% compliance with physical constraints on atom positions and charges. When compared with a state-of-the-art diffusion model, our best-performing model—LLaMA-2 70B—doubled the rate of generating materials predicted as metastable, reaching a notable figure of 49%.

Empirical Evaluation

Our evaluation framework includes both ML potential-based methods and density functional theory (DFT) calculations—the gold standard in material science. The model's success is underscored by its high output of materials with low energies above the convex hull, a metric indicating stability. Specifically, the results are twofold: a significant percentage of generated materials are predicted to be stable by ML potentials, and a literal use of DFT calculations corroborated that our method produces materials with legitimate stabilities.

Furthermore, with scaling up the model size, there was an observable improvement in the model's ability to capture the translational symmetries inherent to crystalline structures. This showcases the nuanced understanding of symmetries the model acquires as it scales, which is integral to translating the data's physicality into the LLM's domain.

Applications and Future Work

The implications of this research extend to broadening the horizons of generative modeling within the materials science domain. The model's text prompt flexibility offers a clear path toward generating materials with customizable characteristics and optimizing existing structures.

Looking forward, the potential for these models to incorporate conditioning on scientific literature, refine conditional generation techniques, and explore sample-efficient design strategies present promising avenues for further exploration. Moreover, as there's a delicate balance between precision of model prompts and tokenization strategies, future research is required to address these sensitivities and the identified limitations of unphysical structure generation.

By situating LLMs at the threshold of generating not just legible but materially viable constructs, this work stands to significantly streamline the process of material discovery and development, propelling us towards an AI-assisted future in scientific discovery.

PDF Markdown

Related Papers

Tweets

https://twitter.com/gruver_nate/status/1755754623694946613

https://twitter.com/_akhaliq/status/1755417335014273140

https://twitter.com/anuroopsriram/status/1755782965882433994

https://twitter.com/miolini/status/1756095912819179947

https://twitter.com/gruver_nate/status/1757256696198484089

https://twitter.com/fly51fly/status/1756437837824856348