Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space (2311.14652v2)

Published 24 Nov 2023 in cs.LG, cs.CL, and stat.ML

Abstract: Attention computation takes both the time complexity of $O(n2)$ and the space complexity of $O(n2)$ simultaneously, which makes deploying LLMs in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n{1+o(1)}$ time executions. Despite this, computing the approximated attention matrix $U_1U_2\top \in \mathbb{R}{n \times n}$ still necessitates $O(n2)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in a streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Coresets meet edcs: algorithms for matching and vertex cover on massive graphs. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1616–1635. SIAM, 2019.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Adobe. Adobe firefly. https://www.adobe.com/sensei/generative-ai/firefly.html, 2023.
  4. Linear programming in the semi-streaming model with application to the maximum matching problem. In International Colloquium on Automata, Languages, and Programming, pages 526–538. Springer, 2011.
  5. Access to data and number of iterations: Dual primal algorithms for maximum matching under resource constraints. ACM Transactions on Parallel Computing (TOPC), 4(4):1–40, 2018.
  6. A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023.
  7. Semi-streaming bipartite matching in fewer passes and optimal space. In SODA. arXiv preprint arXiv:2011.03495, 2022.
  8. Multi-pass graph streaming lower bounds for cycle counting, max-cut, matching size, and other problems. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 354–364. IEEE, 2020.
  9. Distributed and streaming linear programming in low dimensions. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS), pages 236–253, 2019.
  10. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
  11. An auction algorithm for bipartite matching in streaming and massively parallel computation models. In Symposium on Simplicity in Algorithms (SOSA), pages 165–171. SIAM, 2021.
  12. Sam Altman. Openai devday. https://www.youtube.com/watch?v=U9mJuUkhUzk, 2023.
  13. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137–147, 1999.
  14. Near-quadratic lower bounds for two-pass graph streaming algorithms. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 342–353. IEEE, 2020.
  15. Fast attention requires bounded entries. In NeurIPS, 2023.
  16. How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. arXiv preprint arXiv:2310.04064, 2023.
  17. BARD. Try bard, an ai experiment by google. Google, February 2023.
  18. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  19. Aaron Bernstein. Improved bound for matching in random-order streams. arXiv preprint arXiv:2005.00417, 2020.
  20. Review of artificial intelligence-based question-answering systems in healthcare. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(2):e1487, 2023.
  21. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  22. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv preprint arXiv:2304.02207, 2023.
  23. Streaming complexity of spanning tree computation. In 37th international symposium on theoretical aspects of computer science (STACS), 2020.
  24. ChatGPT. Optimizing language models for dialogue. OpenAI Blog, November 2022.
  25. Mongoose: A learnable lsh framework for efficient neural network training. In International Conference on Learning Representations, 2021.
  26. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  27. How to protect copyright data in optimization of large language models? arXiv preprint arXiv:2308.12247, 2023.
  28. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  29. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  30. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  31. Solving tensor low cycle rank approximation. In BigData. arXiv preprint arXiv:2304.06594, 2023.
  32. On the optimization and generalization of multi-head attention. arXiv preprint arXiv:2310.12680, 2023.
  33. A survey on proactive dialogue systems: Problems, methods, and prospects. arXiv preprint arXiv:2305.02750, 2023.
  34. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
  35. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arxiv preprint: arxiv 2304.03426, 2023.
  36. Economic efficiency requires interaction. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 233–242, 2014.
  37. Recent advances towards safe, responsible, and moral dialogue systems: A survey. arXiv preprint arXiv:2302.09270, 2023.
  38. Bipartite matching in the semi-streaming model. Algorithmica, 63(1-2):490–508, 2012.
  39. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science, (Proceedings), 2007.
  40. Approximate maximum matching in random streams. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1773–1785. SIAM, 2020.
  41. On graph problems in a semi-streaming model. In International Colloquium on Automata, Languages, and Programming, pages 531–543. Springer, 2004.
  42. Sgcsumm: An extractive multi-document summarization method based on pre-trained language model, submodularity, and graph convolutional neural networks. Expert Systems with Applications, 215:119308, 2023.
  43. On the communication and streaming complexity of maximum bipartite matching. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 468–485. SIAM, 2012.
  44. An over-parameterized exponential regression. arXiv preprint arXiv:2303.16504, 2023.
  45. A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time. arXiv preprint arXiv:2309.07418, 2023.
  46. In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023.
  47. Learning-based frequency estimation algorithms. In International Conference on Learning Representations, 2019.
  48. Hyperattention: Long-context attention in near-linear time. arXiv preprint arXiv:2310.05869, 2023.
  49. Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles. arXiv preprint arXiv:2309.09369, 2023.
  50. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
  51. Perfect l_p sampling in a data stream. SIAM Journal on Computing, 50(2):382–439, 2021.
  52. Michael Kapralov. Better bounds for matchings in the streaming model. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1679–1697. SIAM, 2013.
  53. Longeval: Guidelines for human evaluation of faithfulness in long-form summarization. arXiv preprint arXiv:2301.13298, 2023.
  54. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  55. Approximating matching size from random streams. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 734–751. SIAM, 2014.
  56. Polysketchformer: Fast transformers via sketches for polynomial kernels. arXiv preprint arXiv:2310.01655, 2023.
  57. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  58. How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245, 2023.
  59. Heavy hitters via cluster-preserving clustering. Communications of the ACM, 62(8):95–100, 2019.
  60. How long can opensource llms truly promise on context length, 2023.
  61. Solving regularized exp, cosh and sinh regression problems. arXiv preprint, 2303.15725, 2023.
  62. Space-efficient interior point method, with applications to linear programming and maximum weight bipartite matching. In International Colloquium on Automata, Languages and Programming (ICALP), pages 88:1–88:14, 2023.
  63. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.
  64. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023.
  65. Andrew McGregor. Finding graph matchings in data streams. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 170–181. Springer, 2005.
  66. Stronger l2/l2 compressed sensing; without iterating. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 289–297, 2019.
  67. Recent advances in deep learning based dialogue systems: A systematic survey. Artificial intelligence review, 56(4):3055–3155, 2023.
  68. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  69. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  70. Trainable transformer in transformer. arXiv preprint arXiv:2307.01189, 2023.
  71. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  72. A (2+ϵitalic-ϵ\epsilonitalic_ϵ)-approximation for maximum weight matching in the semi-streaming model. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2153–2161. SIAM, 2017.
  73. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  74. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Computing Surveys, 55(10):1–45, 2023.
  75. Analysis of community question-answering issues via machine learning and deep learning: State-of-the-art review. CAAI Transactions on Intelligence Technology, 8(1):95–117, 2023.
  76. Representational strengths and limitations of transformers. arXiv preprint arXiv:2306.02896, 2023.
  77. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  78. A mathematical abstraction for balancing the trade-off between creativity and reality in large language models. arXiv preprint arXiv:2306.02295, 2023.
  79. Synthesizer: Rethinking self-attention for transformer models. In International conference on machine learning, pages 10183–10192. PMLR, 2021.
  80. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  81. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  82. Transformers as support vector machines. arXiv preprint arXiv:2308.16898, 2023.
  83. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  84. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  85. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  86. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  87. Kdeformer: Accelerating transformers via kernel density estimation. arXiv preprint arXiv:2302.02451, 2023.
  88. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.
  89. Extractive summarization via chatgpt for faithful summary generation. arXiv preprint arXiv:2304.04193, 2023.
  90. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Raghav Addanki (1 paper)
  2. Chenyang Li (71 papers)
  3. Zhao Song (253 papers)
  4. Chiwun Yang (14 papers)
Citations (3)