Crystal Structure Generation with Autoregressive Large Language Modeling (2307.04340v3)

Published 10 Jul 2023 in cond-mat.mtrl-sci

Abstract: The generation of plausible crystal structures is often the first step in predicting the structure and properties of a material from its chemical composition. Quickly generating and predicting inorganic crystal structures is important for the discovery of new materials, which can target applications such as energy or electronic devices. However, most current methods for crystal structure prediction are computationally expensive, slowing the pace of innovation. Seeding structure prediction algorithms with quality generated candidates can overcome a major bottleneck. Here, we introduce CrystaLLM, a methodology for the versatile generation of crystal structures, based on the autoregressive LLMing (LLM) of the Crystallographic Information File (CIF) format. Trained on millions of CIF files, CrystaLLM focuses on modeling crystal structures through text. CrystaLLM can produce plausible crystal structures for a wide range of inorganic compounds unseen in training, as demonstrated by ab initio simulations. The integration with predictors of formation energy permits the use of a Monte Carlo Tree Search algorithm to improve the generation of meaningful structures. Our approach challenges conventional representations of crystals, and demonstrates the potential of LLMs for learning effective 'world models' of crystal chemistry, which will lead to accelerated discovery and innovation in materials science.

Citations (27)

View on Semantic Scholar

Summary

The paper demonstrates CrystaLLM’s ability to generate over 90% valid inorganic crystal structures using autoregressive language modeling on CIF data.
It integrates Monte Carlo Tree Search with energy predictors to steer the sampling towards low-energy, stable crystal structures.
The model outperforms methods like CDVAE and DiffCSP, underscoring its potential for accelerating high-throughput materials discovery.

Crystal Structure Generation with Autoregressive LLMing

The paper presents an innovative approach to crystal structure generation by leveraging the capabilities of autoregressive LLMs. The method, termed CrystaLLM, utilizes the Crystallographic Information File (CIF) format to train LLMs for the generation of plausible inorganic crystal structures. This approach demonstrates significant potential in addressing a pivotal challenge in material science: the computational expense of crystal structure prediction.

CrystaLLM is trained on a vast corpus of CIF files, estimated in millions, which encapsulates the structural information of inorganic compounds. This model is distinct because it treats crystal structures as sequences of text, enabling the use of LLMing techniques to generate new structures. Such an approach is groundbreaking in its deviation from conventional methods that rely on direct numeric and graphical data representations of crystals.

Results and Analysis

The experimentations with CrystaLLM reveal its adeptness in generating valid crystal structures that are physically plausible and consistent with specified chemical compositions. The model shows the capability to extrapolate from its training data to generate structures of compounds not explicitly included in the training set. The success rate of generating valid CIFs surpassed 90% in experimental settings, showcasing the model's substantial accuracy.

One of the paper's highlights is the use of Monte Carlo Tree Search (MCTS) integrated with formation energy predictors, enabling a strategic improvement in the sampling of meaningful crystal structures. This integration facilitates the generation of low-energy, stable structures, demonstrating enhanced predictive capacity over traditional sampling methods.

The evaluation against contemporary ML-based approaches like CDVAE and DiffCSP underscores CrystaLLM's competitive performance. In several benchmark cases, CrystaLLM achieved superior results in terms of match rate and root mean square error (RMSE) of predicted structures when compared to these models. CrystaLLM also offers unique advantages, such as the ability to condition predictions on symmetry space groups and potential for fine-tuning for predictive material property tasks.

Implications and Future Directions

The successful demonstration of CrystaLLM opens several avenues for future research and practical application. The model's capacity to accurately generate unseen structures implies utility in high-throughput materials discovery, potentially accelerating the identification of new materials for various industrial applications.

The integration of MCTS for enhancing structure generation quality indicates a promising direction for future implementations. Moreover, the potential for fine-tuning CrystaLLM for specific property predictions, leveraging established machine learning practices in materials informatics, could further extend its applicability.

The robustness of CrystaLLM in terms of generating a diverse array of plausible structures presents substantial implications for theoretical models of crystal chemistry, suggesting that autoregressive models are capable of learning and representing complex structural relationships in material sciences.

Overall, CrystaLLM emerges as a vital tool that could significantly reduce the computational bottlenecks in crystal structure prediction. This work lays a strong foundation for further exploration into the use of LLMs in materials science, paving the way for enhanced methodologies in both generative and predictive modeling in the field of crystal chemistry.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_lantunes/status/1757287574245097749

https://twitter.com/modal_labs/status/1754957127376933204