Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are queries and keys always relevant? A case study on Transformer wave functions (2405.18874v1)

Published 29 May 2024 in cond-mat.dis-nn, cs.CL, and physics.comp-ph

Abstract: The dot product attention mechanism, originally designed for NLP tasks, is a cornerstone of modern Transformers. It adeptly captures semantic relationships between word pairs in sentences by computing a similarity overlap between queries and keys. In this work, we explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions to approximate ground states of quantum many-body spin Hamiltonians. Specifically, we perform numerical simulations on the two-dimensional $J_1$-$J_2$ Heisenberg model, a common benchmark in the field of quantum-many body systems on lattice. By comparing the performance of standard attention mechanisms with a simplified version that excludes queries and keys, relying solely on positions, we achieve competitive results while reducing computational cost and parameter usage. Furthermore, through the analysis of the attention maps generated by standard attention mechanisms, we show that the attention weights become effectively input-independent at the end of the optimization. We support the numerical results with analytical calculations, providing physical insights of why queries and keys should be, in principle, omitted from the attention mechanism when studying large systems. Interestingly, the same arguments can be extended to the NLP domain, in the limit of long input sentences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Attention is all you need. Dec 2017.
  2. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019.
  3. Improving language understanding by generative pre-training, 2018.
  4. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  5. Highly accurate protein structure prediction with alphafold. Nature, 596:1–11, 08 2021.
  6. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2024.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2021.
  8. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
  9. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8:842–866, 01 2021.
  10. Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention, pages 34–45.
  11. G. Carleo and M. Troyer. Solving the quantum many-body problem with artificial neural networks. Science, 355(6325):602–606, Feb 2017.
  12. Language models for quantum simulation. Nature Computat. Sci., 4(1):11–18, 2024.
  13. Variational monte carlo with large patched transformers. arXiv preprint arXiv:2306.03921, 2023.
  14. A simple linear algebra identity to optimize large-scale neural network quantum states. arXiv preprint arXiv:2310.05715, 2023.
  15. Transformer wave function for the shastry-sutherland model: emergence of a spin-liquid phase. arXiv preprint arXiv:2311.16889, 2023.
  16. Autoregressive neural network for simulating open quantum systems via a probabilistic formulation. Phys. Rev. Lett., 128:090501, Feb 2022.
  17. Gauge-invariant and anyonic-symmetric autoregressive neural network for quantum lattice models. Phys. Rev. Res., 5:013216, Mar 2023.
  18. Transformer variational wave functions for frustrated quantum spin systems. Phys. Rev. Lett., 130:236401, Jun 2023.
  19. A self-attention ansatz for ab-initio quantum chemistry. arXiv preprint arXiv:2211.13672, 2023.
  20. A.W. Sandvik. Computational studies of quantum spin systems. AIP Conference Proceedings, 1297(1):135–338, 2010.
  21. Modern Quantum Mechanics. Cambridge University Press, 3 edition, 2020.
  22. F. Becca and S. Sorella. Quantum Monte Carlo Approaches for Correlated Systems. Cambridge University Press, 2017.
  23. JAX: composable transformations of Python+NumPy programs, 2018.
  24. Sandro Sorella. Green function monte carlo with stochastic reconfiguration. Phys. Rev. Lett., 80:4558–4561, May 1998.
  25. Sandro Sorella. Wave function optimization in the variational monte carlo method. Physical Review B, 71(24), June 2005.
  26. S. Amari and S.C. Douglas. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181), volume 2, pages 1213–1216 vol.2, 1998.
  27. Fisher information and natural gradient learning in random deep networks. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 694–702. PMLR, 16–18 Apr 2019.
  28. Geometry of learning neural quantum states. Phys. Rev. Res., 2:023232, May 2020.
  29. Ao Chen and Markus Heyl. Efficient optimization of deep neural quantum states toward machine precision. arXiv preprint arXiv:2302.01941, 2023.
  30. Neural-network quantum states, string-bond states, and chiral topological states. Phys. Rev. X, 8:011006, Jan 2018.
  31. From architectures to applications: A review of neural quantum states. arXiv preprint arXiv:2402.09402, 2024.
  32. Y. Nomura and M. Imada. Dirac-type nodal spin liquid revealed by refined quantum many-body solver using neural-network wave function, correlation ratio, and level spectroscopy. Phys. Rev. X, 11:031034, Aug 2021.
  33. High-accuracy variational monte carlo for frustrated magnets with deep neural networks. Phys. Rev. B, 108:054410, Aug 2023.
  34. Accurate neural quantum states for interacting lattice bosons. arXiv preprint arXiv:2404.07869, 2024.
  35. Fermionic wave functions from neural-network constrained hidden states. Proceedings of the National Academy of Sciences, 119(32), August 2022.
  36. Neural-network quantum states for ultra-cold fermi gases. arXiv preprint arXiv:2305.08831, 2023.
  37. Ab initio solution of the many-electron schrödinger equation with deep neural networks. Phys. Rev. Res., 2:033429, Sep 2020.
  38. Ab-initio variational wave functions for the time-dependent many-electron schrödinger equation. arXiv preprint arXiv:2403.07447, 2024.
  39. On layer normalization in the transformer architecture. arXiv preprint arXiv:2002.04745, 2020.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2023.
  41. Coatnet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803, 2021.
  42. Mapping of attention mechanisms to a generalized potts model. Phys. Rev. Res., 6:023057, Apr 2024.
  43. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
  44. The case for translation-invariant self-attention in transformer-based language models. arXiv preprint arXiv:2106.01950, 2021.
  45. Rethinking positional encoding in language pre-training. In International Conference on Learning Representations, 2021.
  46. Numerical study of the two-dimensional heisenberg model using a green function monte carlo technique with a fixed number of walkers. Phys. Rev. B, 57:11446–11456, May 1998.
  47. Anders W. Sandvik. Finite-size scaling of the ground-state parameters of the two-dimensional heisenberg model. Phys. Rev. B, 56:11678–11690, Nov 1997.
  48. Quantum spin liquids: a review. Reports on Progress in Physics, 80(1):016502, nov 2016.
  49. Direct evidence for a gapless Z2subscript𝑍2{Z}_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT spin liquid by frustrating néel antiferromagnetism. Phys. Rev. B, 88:060402, Aug 2013.
  50. Plaquette ordered phase and quantum phase diagram in the spin-1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG J1−J2subscript𝐽1subscript𝐽2{J}_{1}\text{$-$}{J}_{2}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square heisenberg model. Phys. Rev. Lett., 113:027201, Jul 2014.
  51. Yusuke Nomura. Helping restricted boltzmann machines with quantum-state representation by restoring symmetry. Journal of Physics: Condensed Matter, 33(17):174003, apr 2021.
  52. Optimizing design choices for neural quantum states. Phys. Rev. B, 107:195115, May 2023.
  53. NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems. SciPost Phys. Codebases, page 7, 2022.
  54. Phillip Lippe. UvA Deep Learning Tutorials. https://uvadlc-notebooks.readthedocs.io/en/latest/, 2024.
  55. A phase transition between positional and semantic learning in a solvable model of dot-product attention, 2024.
  56. B.S. Shastry and B. Sutherland. Exact ground state of a quantum mechanical antiferromagnet. Physica B+C, 108(1):1069–1070, 1981.
  57. 4-spin plaquette singlet state in the shastry–sutherland compound srcu2(bo3)2. Nature Physics, 13(10):962–966, July 2017.
  58. Fine-tuning neural network quantum states. arXiv preprint arXiv:2403.07795, 2024.
  59. Cluster decomposition properties of the s𝑠sitalic_s matrix. Phys. Rev., 132:2788–2799, Dec 1963.
  60. Steven Weinberg. What is quantum field theory, and what did we think it was?, page 241–251. Cambridge University Press, 1999.
  61. Do transformers need deep long-range memory? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7524–7529, Online, July 2020. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Riccardo Rende (10 papers)
  2. Luciano Loris Viteritti (8 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com