Quantum Many-Body Physics Calculations with Large Language Models (2403.03154v2)
Abstract: LLMs have demonstrated an unprecedented ability to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock method, requiring an analytic multi-step calculation deriving approximate Hamiltonian and corresponding self-consistency equations. To carry out the calculations using LLMs, we design multi-step prompt templates that break down the analytic calculation into standardized steps with placeholders for problem-specific information. We evaluate GPT-4's performance in executing the calculation for 15 research papers from the past decade, demonstrating that, with correction of intermediate steps, it can correctly derive the final Hartree-Fock Hamiltonian in 13 cases and makes minor errors in 2 cases. Aggregating across all research papers, we find an average score of 87.5 (out of 100) on the execution of individual calculation steps. Overall, the requisite skill for doing these calculations is at the graduate level in quantum condensed matter theory. We further use LLMs to mitigate the two primary bottlenecks in this evaluation process: (i) extracting information from papers to fill in templates and (ii) automatic scoring of the calculation steps, demonstrating good results in both cases. The strong performance is the first step for developing algorithms that automatically explore theoretical hypotheses at an unprecedented scale.
- Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, vol. 33, 1877–1901 (2020).
- Shazeer, N. et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
- Anil, R. et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
- Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Aarohi, S. e. a. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (2023). URL https://openreview.net/forum?id=uyTL5Bvosj.
- Lewkowycz, A. et al. Solving quantitative reasoning problems with language models. In Oh, A. H., Agarwal, A., Belgrave, D. & Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). URL https://openreview.net/forum?id=IFXTZERXdM7.
- Examining the potential and pitfalls of chatgpt in science and engineering problem-solving. arXiv preprint arXiv:2310.08773 (2023).
- Chen, M. et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Ai-assisted coding: Experiments with gpt-4. arXiv preprint arXiv:2304.13187 (2023).
- Large language models encode clinical knowledge. Nature 620, 172–180 (2023). URL https://doi.org/10.1038/s41586-023-06291-2.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023).
- Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332 (2023).
- Romera-Paredes, B. et al. Mathematical discoveries from program search with large language models. Nature 1–3 (2023).
- Kaplan, J. et al. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
- Hoffmann, J. et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
- Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Team, G. et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
- Hugging Face. LMSys Chatbot Arena Leaderboard. https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard (2024).
- Hendrycks, D. et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).
- Gpt-4 passes the bar exam. Available at SSRN 4389233 (2023).
- We focus solely on LLMs available via model APIs and not on LLMs and foundational models trained or tuned on domain specific data.
- The impact of large language models on scientific discovery: a preliminary study using gpt-4. arXiv preprint arXiv:2311.07361 (2023).
- Lála, J. et al. Paperqa: Retrieval-augmented generative agent for scientific research. arXiv preprint arXiv:2312.07559 (2023).
- Hendrycks, D. et al. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021).
- Cobbe, K. et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021).
- Do large language models understand chemistry? a conversation with chatgpt. Journal of Chemical Information and Modeling 63, 1649–1655 (2023).
- White, A. D. et al. Assessment of chemistry knowledge in large language models that generate code. Digital Discovery 2, 368–376 (2023).
- Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376 (2023).
- Lu, P. et al. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842 (2023).
- Shen, Y. et al. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580 (2023).
- Solving olympiad geometry without human demonstrations. Nature 625, 476–482 (2024).
- Taylor, R. et al. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085 (2022).
- Condensed matter field theory (Cambridge university press, 2010).
- Over 6456 papers mention Hartree-Fock in the abstract of papers in the cond-mat arXiv preprint server over the last decade.
- See supplemental information for the paper.
- Topological phases in ab-stacked mote2/wse2subscriptmote2subscriptwse2{\mathrm{mote}}_{2}/{\mathrm{wse}}_{2}roman_mote start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / roman_wse start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: 𝕫2subscript𝕫2{\mathbb{z}}_{2}blackboard_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT topological insulators, chern insulators, and topological charge density waves. Physical Review Letters 129, 056804 (2022). eprint 2111.01152.
- Evaluations were carried out using checkpoints ‘gpt-4’ and ‘gpt-4-0613’ referenced in https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo. At the time that the experiments in the paper were performed, ‘gpt-4’ pointed to ‘gpt-4-0613’. The abstract to execution experiment was performed using GPT-4 queried via the web interface.
- Competing magnetic states in transition metal dichalcogenide moir\’e materials. Physical Review B 104, 214403 (2021). eprint 2108.02159.