LLM-Prop: Predicting Physical And Electronic Properties Of Crystalline Solids From Their Text Descriptions
Abstract: The prediction of crystal properties plays a crucial role in the crystal design process. Current methods for predicting crystal properties focus on modeling crystal structures using graph neural networks (GNNs). Although GNNs are powerful, accurately modeling the complex interactions between atoms and molecules within a crystal remains a challenge. Surprisingly, predicting crystal properties from crystal text descriptions is understudied, despite the rich information and expressiveness that text data offer. One of the main reasons is the lack of publicly available data for this task. In this paper, we develop and make public a benchmark dataset (called TextEdge) that contains text descriptions of crystal structures with their properties. We then propose LLM-Prop, a method that leverages the general-purpose learning capabilities of LLMs to predict the physical and electronic properties of crystals from their text descriptions. LLM-Prop outperforms the current state-of-the-art GNN-based crystal property predictor by about 4% in predicting band gap, 3% in classifying whether the band gap is direct or indirect, and 66% in predicting unit cell volume. LLM-Prop also outperforms a finetuned MatBERT, a domain-specific pre-trained BERT model, despite having 3 times fewer parameters. Our empirical results may highlight the current inability of GNNs to capture information pertaining to space group symmetry and Wyckoff sites for accurate crystal property prediction.
- Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620.
- A universal graph deep learning interatomic potential for the periodic table. Nature Computational Science, 2(11):718–728.
- Graph networks as a universal machine learning framework for molecules and crystals. Chemistry of Materials, 31(9):3564–3572.
- Atomistic line graph neural network for improved materials property predictions. npj Computational Materials, 7(1):1–8.
- Charting the complete elastic properties of inorganic crystalline compounds. Scientific data, 2(1):1–13.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8:439–453.
- Robocrystallographer: automated crystal structure text descriptions and analysis. MRS Communications, 9(3):874–881.
- Injecting numerical reasoning skills into language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 946–958.
- Matscibert: A materials domain language model for text mining and information extraction. npj Computational Materials, 8(1):102.
- Batterybert: A pretrained language model for battery database enhancement. Journal of Chemical Information and Modeling, 62(24):6365–6377.
- Graph neural networks for predicting protein functions. In 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 221–225. IEEE.
- A high-throughput infrastructure for density functional theory calculations. Computational Materials Science, 50(8):2295–2310.
- The materials project: A materials genome approach to accelerating materials innovation. apl materials, 1 (1): 011002, 2013.
- Prediction of protein–protein interaction using graph neural networks. Scientific Reports, 12(1):8360.
- Drug–target affinity prediction using graph neural network and contact maps. RSC advances, 10(35):20701–20712.
- Spanbert: Improving pre-training by representing and predicting spans. Transactions of the association for computational linguistics, 8:64–77.
- Equivariant networks for crystal structures. Advances in Neural Information Processing Systems, 35:4150–4164.
- The open quantum materials database (oqmd): assessing the accuracy of dft formation energies. npj Computational Materials, 1(1):1–15.
- Toward accurate interpretable predictions of materials properties within transformer language models. arXiv preprint arXiv:2303.12188.
- Combinatorial screening for new materials in unconstrained composition space with machine learning. Physical Review B, 89(9):094104.
- Kinnews and kirnews: Benchmarking cross-lingual text classification for kinyarwanda and kirundi. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5507–5521.
- High-throughput machine-learning-driven synthesis of full-heusler compounds. Chemistry of Materials, 28(20):7324–7331.
- Is machine learning redefining the perovskite solar cells? Journal of Energy Chemistry, 66:74–90.
- Leveraging language representation for material recommendation, ranking, and exploration. arXiv preprint arXiv:2305.01101.
- Machine-learning-assisted materials discovery using failed experiments. Nature, 533(7601):73–76.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Deeprank-gnn: a graph neural network framework to learn patterns in protein–protein interfaces. Bioinformatics, 39(1):btac759.
- Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in neural information processing systems, 30.
- Npi-gnn: predicting ncrna–protein interactions with deep graph neural networks. Briefings in bioinformatics, 22(5):bbab051.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE.
- Matsci-nlp: Evaluating scientific language models on materials science language tasks using text-to-schema modeling. arXiv preprint arXiv:2305.08264.
- Fast and flexible protein design using deep graph neural networks. Cell systems, 11(4):402–411.
- Machine learning for perovskite materials design and discovery. npj Computational Materials, 7(1):1–18.
- Representing numbers in nlp: a survey and a vision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–656.
- The impact of domain-specific pre-training on named entity recognition tasks in materials science. Available at SSRN 3950755.
- Do nlp models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5307–5315.
- Advanced graph and sequence neural networks for molecular property prediction and drug discovery. Bioinformatics, 38(9):2579–2586.
- A general-purpose machine learning framework for predicting properties of inorganic materials. npj Computational Materials, 2(1):1–7.
- Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters, 120(14):145301.
- Periodic graph transformers for crystal material property prediction. Advances in Neural Information Processing Systems, 35:15066–15080.
- Do language embeddings capture scales? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4889–4896.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.