One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space (2311.14652v2)
Abstract: Attention computation takes both the time complexity of $O(n2)$ and the space complexity of $O(n2)$ simultaneously, which makes deploying LLMs in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n{1+o(1)}$ time executions. Despite this, computing the approximated attention matrix $U_1U_2\top \in \mathbb{R}{n \times n}$ still necessitates $O(n2)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in a streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.
- Coresets meet edcs: algorithms for matching and vertex cover on massive graphs. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1616–1635. SIAM, 2019.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Adobe. Adobe firefly. https://www.adobe.com/sensei/generative-ai/firefly.html, 2023.
- Linear programming in the semi-streaming model with application to the maximum matching problem. In International Colloquium on Automata, Languages, and Programming, pages 526–538. Springer, 2011.
- Access to data and number of iterations: Dual primal algorithms for maximum matching under resource constraints. ACM Transactions on Parallel Computing (TOPC), 4(4):1–40, 2018.
- A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023.
- Semi-streaming bipartite matching in fewer passes and optimal space. In SODA. arXiv preprint arXiv:2011.03495, 2022.
- Multi-pass graph streaming lower bounds for cycle counting, max-cut, matching size, and other problems. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 354–364. IEEE, 2020.
- Distributed and streaming linear programming in low dimensions. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS), pages 236–253, 2019.
- Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
- An auction algorithm for bipartite matching in streaming and massively parallel computation models. In Symposium on Simplicity in Algorithms (SOSA), pages 165–171. SIAM, 2021.
- Sam Altman. Openai devday. https://www.youtube.com/watch?v=U9mJuUkhUzk, 2023.
- The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137–147, 1999.
- Near-quadratic lower bounds for two-pass graph streaming algorithms. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 342–353. IEEE, 2020.
- Fast attention requires bounded entries. In NeurIPS, 2023.
- How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. arXiv preprint arXiv:2310.04064, 2023.
- BARD. Try bard, an ai experiment by google. Google, February 2023.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Aaron Bernstein. Improved bound for matching in random-order streams. arXiv preprint arXiv:2005.00417, 2020.
- Review of artificial intelligence-based question-answering systems in healthcare. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(2):e1487, 2023.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Algorithm and hardness for dynamic attention maintenance in large language models. arXiv preprint arXiv:2304.02207, 2023.
- Streaming complexity of spanning tree computation. In 37th international symposium on theoretical aspects of computer science (STACS), 2020.
- ChatGPT. Optimizing language models for dialogue. OpenAI Blog, November 2022.
- Mongoose: A learnable lsh framework for efficient neural network training. In International Conference on Learning Representations, 2021.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- How to protect copyright data in optimization of large language models? arXiv preprint arXiv:2308.12247, 2023.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Solving tensor low cycle rank approximation. In BigData. arXiv preprint arXiv:2304.06594, 2023.
- On the optimization and generalization of multi-head attention. arXiv preprint arXiv:2310.12680, 2023.
- A survey on proactive dialogue systems: Problems, methods, and prospects. arXiv preprint arXiv:2305.02750, 2023.
- Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
- Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arxiv preprint: arxiv 2304.03426, 2023.
- Economic efficiency requires interaction. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 233–242, 2014.
- Recent advances towards safe, responsible, and moral dialogue systems: A survey. arXiv preprint arXiv:2302.09270, 2023.
- Bipartite matching in the semi-streaming model. Algorithmica, 63(1-2):490–508, 2012.
- Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science, (Proceedings), 2007.
- Approximate maximum matching in random streams. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1773–1785. SIAM, 2020.
- On graph problems in a semi-streaming model. In International Colloquium on Automata, Languages, and Programming, pages 531–543. Springer, 2004.
- Sgcsumm: An extractive multi-document summarization method based on pre-trained language model, submodularity, and graph convolutional neural networks. Expert Systems with Applications, 215:119308, 2023.
- On the communication and streaming complexity of maximum bipartite matching. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 468–485. SIAM, 2012.
- An over-parameterized exponential regression. arXiv preprint arXiv:2303.16504, 2023.
- A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time. arXiv preprint arXiv:2309.07418, 2023.
- In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023.
- Learning-based frequency estimation algorithms. In International Conference on Learning Representations, 2019.
- Hyperattention: Long-context attention in near-linear time. arXiv preprint arXiv:2310.05869, 2023.
- Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles. arXiv preprint arXiv:2309.09369, 2023.
- Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
- Perfect l_p sampling in a data stream. SIAM Journal on Computing, 50(2):382–439, 2021.
- Michael Kapralov. Better bounds for matchings in the streaming model. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1679–1697. SIAM, 2013.
- Longeval: Guidelines for human evaluation of faithfulness in long-form summarization. arXiv preprint arXiv:2301.13298, 2023.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Approximating matching size from random streams. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 734–751. SIAM, 2014.
- Polysketchformer: Fast transformers via sketches for polynomial kernels. arXiv preprint arXiv:2310.01655, 2023.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
- How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245, 2023.
- Heavy hitters via cluster-preserving clustering. Communications of the ACM, 62(8):95–100, 2019.
- How long can opensource llms truly promise on context length, 2023.
- Solving regularized exp, cosh and sinh regression problems. arXiv preprint, 2303.15725, 2023.
- Space-efficient interior point method, with applications to linear programming and maximum weight bipartite matching. In International Colloquium on Automata, Languages and Programming (ICALP), pages 88:1–88:14, 2023.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023.
- Andrew McGregor. Finding graph matchings in data streams. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 170–181. Springer, 2005.
- Stronger l2/l2 compressed sensing; without iterating. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 289–297, 2019.
- Recent advances in deep learning based dialogue systems: A systematic survey. Artificial intelligence review, 56(4):3055–3155, 2023.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Trainable transformer in transformer. arXiv preprint arXiv:2307.01189, 2023.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- A (2+ϵitalic-ϵ\epsilonitalic_ϵ)-approximation for maximum weight matching in the semi-streaming model. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2153–2161. SIAM, 2017.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Computing Surveys, 55(10):1–45, 2023.
- Analysis of community question-answering issues via machine learning and deep learning: State-of-the-art review. CAAI Transactions on Intelligence Technology, 8(1):95–117, 2023.
- Representational strengths and limitations of transformers. arXiv preprint arXiv:2306.02896, 2023.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- A mathematical abstraction for balancing the trade-off between creativity and reality in large language models. arXiv preprint arXiv:2306.02295, 2023.
- Synthesizer: Rethinking self-attention for transformer models. In International conference on machine learning, pages 10183–10192. PMLR, 2021.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Transformers as support vector machines. arXiv preprint arXiv:2308.16898, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Kdeformer: Accelerating transformers via kernel density estimation. arXiv preprint arXiv:2302.02451, 2023.
- Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.
- Extractive summarization via chatgpt for faithful summary generation. arXiv preprint arXiv:2304.04193, 2023.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, 2023.
- Raghav Addanki (1 paper)
- Chenyang Li (71 papers)
- Zhao Song (253 papers)
- Chiwun Yang (14 papers)