Folded Context Condensation in Path Integral Formalism for Infinite Context Transformers (2405.04620v5)
Abstract: In this work, we present a generalized formulation of the Transformer algorithm by reinterpreting its core mechanisms within the framework of Path Integral formalism. In this perspective, the attention mechanism is recast as a process that integrates all possible transition paths leading to future token states, with temporal evolution governed by the Feed-Forward Network. By systematically mapping each component of the Transformer to its counterpart in the Path Integral formulation, we obtain a more compact and efficient representation, in which the contextual information of a sequence is condensed into memory-like segments. These segments are recurrently processed across Transformer layers, enabling more effective long-term information retention. We validate the effectiveness of this approach through the Passkey retrieval task and a summarization task, demonstrating that the proposed method preserves historical information while exhibiting memory usage that scales linearly with sequence length. This contrasts with the non-linear memory growth typically observed in standard attention mechanisms. We expect that this quantum-inspired generalization of the Transformer architecture will open new avenues for enhancing both the efficiency and expressiveness of future Transformer models.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Advances in neural information processing systems, 30, (2017)
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al, “Language models are few-shot learners,” Advances in neural information processing systems, 33:1877-1901, (2020)
- Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al, “Llama 2: Open foundation and fine-tuned chat models,” arXiv:2307.09288.
- Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al, “Palm 2 technical report,” arXiv:2305.10403, 2023.
- Radford, A., Narasimhan, K., Salimans, T., and Sutskever, “Improving language understanding by generative pre-training.”
- Tsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal, “Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention,” arXiv:2404.07143
- Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi and Hongsheng Li, “Efficient Attention: Attention with Linear Complexities,” arXiv:1812.01243
- Srinivasan S. Iyengar, and Sabre Kais, “Analogy between Boltzmann machines and Feynman path integrals,” arXiv:2301.06217v1
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.