Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation (2310.04064v1)

Published 6 Oct 2023 in cs.DS, cs.CC, cs.CL, cs.LG, and stat.ML

Abstract: In the classical transformer attention scheme, we are given three $n \times d$ size matrices $Q, K, V$ (the query, key, and value tokens), and the goal is to compute a new $n \times d$ size matrix $D{-1} \exp(QK\top) V$ where $D = \mathrm{diag}( \exp(QK\top) {\bf 1}_n )$. In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers. The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in $n$. However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations: $\bullet$ On the positive side, if all entries of the input matrices are bounded above by $o(\sqrt[3]{\log n})$ then we show how to approximate the ``tensor-type'' attention matrix in $n{1+o(1)}$ time. $\bullet$ On the negative side, we show that if the entries of the input matrices may be as large as $\Omega(\sqrt[3]{\log n})$, then there is no algorithm that runs faster than $n{3-o(1)}$ (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory). We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation. In 37th Computational Complexity Conference (CCC 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022.
  2. Computational complexity: a modern approach. Cambridge University Press, 2009.
  3. Algorithms and hardness for linear algebra on geometric graphs. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 541–552. IEEE, 2020.
  4. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 141–160. SIAM, 2020.
  5. Distributed pcp theorems for hardness of approximation in p. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 25–36. IEEE, 2017.
  6. Fast attention requires bounded entries. In NeurIPS. arXiv preprint arXiv:2302.13214, 2023.
  7. Algebrization: A new barrier in complexity theory. ACM Transactions on Computation Theory (TOCT), 1(1):1–54, 2009.
  8. Consequences of faster alignment of sequences. In Automata, Languages, and Programming: 41st International Colloquium, ICALP 2014, Copenhagen, Denmark, July 8-11, 2014, Proceedings, Part I 41, pages 39–51. Springer, 2014.
  9. László Babai. Trading group theory for randomness. In Proceedings of the seventeenth annual ACM symposium on Theory of computing, pages 421–429, 1985.
  10. Efficient density evaluation for smooth kernels. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 615–626. IEEE, 2018.
  11. Smoothed analysis of tensor decompositions. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC), pages 594–603, 2014.
  12. Smoothed analysis for tensor methods in unsupervised learning. Mathematical Programming, pages 1–51, 2020.
  13. Language models are few-shot learners. Advances in neural information processing systems (NeurIPS), 33:1877–1901, 2020.
  14. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv e-prints, pages arXiv–2304, 2023.
  15. Pixelated butterfly: Simple and efficient sparse training for neural network models. In International Conference on Learning Representations (ICLR), 2022.
  16. Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems (NeurIPS), 34:17413–17426, 2021.
  17. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  18. Rethinking attention with performers. In ICLR. arXiv preprint arXiv:2009.14794, 2021.
  19. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  20. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  21. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  22. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  23. Optimal sketching for kronecker product regression and low rank approximation. Advances in neural information processing systems, 32, 2019.
  24. Smyrf-efficient attention using asymmetric clustering. Advances in Neural Information Processing Systems (NeurIPS), 33:6476–6489, 2020.
  25. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.
  26. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  27. Sketching for kronecker product regression and p-splines. In International Conference on Artificial Intelligence and Statistics, pages 1299–1308. PMLR, 2018.
  28. Faster robust tensor power method for arbitrary order. arXiv preprint arXiv:2306.00406, 2023.
  29. Valerii Denisovich Goppa. Codes on algebraic curves. In Doklady Akademii Nauk, volume 259:6, pages 1289–1290. Russian Academy of Sciences, 1981.
  30. Private coins versus public coins in interactive proof systems. In Proceedings of the eighteenth annual ACM symposium on Theory of computing, pages 59–68, 1986.
  31. A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time. arXiv preprint arXiv:2309.07418, 2023.
  32. Differentially private attention computation. arXiv preprint arXiv:2305.04701, 2023.
  33. Gradientcoin: A peer-to-peer decentralized large language models. arXiv preprint arXiv:2308.10502, 2023.
  34. On the complexity of k-sat. Journal of Computer and System Sciences, 62(2):367–375, 2001.
  35. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  36. Polysketchformer: Fast transformers via sketches for polynomial kernels. arXiv preprint arXiv:2310.01655, 2023.
  37. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning (ICLR), pages 5156–5165. PMLR, 2020.
  38. On the computational complexity of self-attention. In International Conference on Algorithmic Learning Theory, pages 597–619. PMLR, 2023.
  39. Deja vu: Contextual sparsity for efficient llms at inference time. In ICML, 2023.
  40. OpenAI. Gpt-4 technical report. https://arxiv.org/pdf/2303.08774.pdf, 2023.
  41. Training and inference of large language models using 8-bit floating point. https://arxiv.org/pdf/2309.17224.pdf, 2023.
  42. cosformer: Rethinking softmax in attention. In ICLR. arXiv preprint arXiv:2202.08791, 2022.
  43. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  44. Improving language understanding by generative pre-training. ., 2018.
  45. Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, pages 1260–1268, 2018.
  46. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  47. A low-complexity algorithm for the construction of algebraic-geometric codes better than the gilbert-varshamov bound. IEEE Transactions on Information Theory, 47(6):2225–2241, 2001.
  48. Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. Advances in neural information processing systems, 32, 2019.
  49. Representational strengths and limitations of transformers. arXiv preprint arXiv:2306.02896, 2023.
  50. Kenneth Wing-Ki Shum. A low-complexity construction of algebraic geometric codes better than the Gilbert-Varshamov bound. University of Southern California, 2000.
  51. Efficient post-training quantization with fp8 formats. arxiv preprint https://arxiv.org/pdf/2309.14592.pdf, 2023.
  52. Madhu Sudan. Lecture notes 8 of madhu sudan’s class, and scribed by josh alman. In Essential coding theory. http://people.seas.harvard.edu/ madhusudan/MIT/ST13/scribe/lect08.pdf, 2013.
  53. Fast sketching of polynomial kernels of polynomial degree. In International Conference on Machine Learning, pages 9812–9823. PMLR, 2021.
  54. Relative error tensor low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2772–2789. SIAM, 2019.
  55. Sketching meets differential privacy: fast algorithm for dynamic kronecker projection maintenance. In International Conference on Machine Learning (ICML), pages 32418–32462. PMLR, 2023.
  56. A nearly-optimal bound for fast regression with ℓ∞subscriptℓ\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT guarantee. In ICML, 2023.
  57. Streaming semidefinite programs: O(n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG) passes, small space and fast runtime. arXiv preprint arXiv:2309.05135, 2023.
  58. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
  59. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  60. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  61. Modular curves, shimura curves, and goppa codes, better than varshamov-gilbert bound. Mathematische Nachrichten, 109(1):21–28, 1982.
  62. Attention is all you need. Advances in neural information processing systems (NeurIPS), 30, 2017.
  63. Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical Computer Science, 348(2-3):357–365, 2005.
  64. Virginia Vassilevska Williams. On some fine-grained questions in algorithms and complexity. In Proceedings of the international congress of mathematicians: Rio de janeiro 2018, pages 3447–3487. World Scientific, 2018.
  65. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  66. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  67. Greg Yang. Wide feedforward or recurrent neural networks of any architecture are gaussian processes. Advances in Neural Information Processing Systems, 32, 2019.
  68. Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.
  69. Greg Yang. Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685, 2020.
  70. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
  71. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE, 2019.
  72. Kdeformer: Accelerating transformers via kernel density estimation. In ICML. arXiv preprint arXiv:2302.02451, 2023.
  73. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  74. H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTO : Heavy-hitter oracle for efficient generative inference of large language models. In NeurIPS. arXiv preprint arXiv:2306.14048, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Josh Alman (36 papers)
  2. Zhao Song (253 papers)
Citations (23)