Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 25 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 134 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs (2403.08845v2)

Published 13 Mar 2024 in cs.LG and cs.AI

Abstract: This study introduces bifurcated attention, a method designed to enhance LLM inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths. Bifurcated attention achieves this by strategically dividing the attention mechanism during incremental decoding into two separate GEMM operations: one focusing on the KV cache from prefill, and another on the decoding process itself. While maintaining the computational load (FLOPs) of standard attention mechanisms, bifurcated attention ensures precise computation with significantly reduced memory IO. Our empirical results show over 2.1$\times$ speedup when sampling 16 output sequences and more than 6.2$\times$ speedup when sampling 32 sequences at context lengths exceeding 8k tokens on a 7B model that uses multi-head attention. The efficiency gains from bifurcated attention translate into lower latency, making it particularly suitable for real-time applications. For instance, it enables massively parallel answer generation without substantially increasing latency, thus enhancing performance when integrated with post-processing techniques such as re-ranking.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333, 2021.
  2. GQA: training generalized multi-query transformer models from multi-head checkpoints. CoRR, abs/2305.13245, 2023. doi: 10.48550/arXiv.2305.13245. URL https://doi.org/10.48550/arXiv.2305.13245.
  3. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
  4. Amazon. Amazon code whisperer. https://aws.amazon.com/codewhisperer/, 2022.
  5. Multi-lingual evaluation of code generation models. CoRR, abs/2210.14868, 2022. doi: 10.48550/arXiv.2210.14868. URL https://doi.org/10.48550/arXiv.2210.14868.
  6. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
  7. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  8. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  9. Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062, 2023.
  10. Medusa: Simple llm inference acceleration framework with multiple decoding heads, 2024.
  11. Accelerating large language model decoding with speculative sampling. CoRR, abs/2302.01318, 2023. doi: 10.48550/arXiv.2302.01318. URL https://doi.org/10.48550/arXiv.2302.01318.
  12. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  13. Generating long sequences with sparse transformers. URL https://openai.com/blog/sparse-transformers, 2019.
  14. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro, 41(2):29–35, 2021.
  15. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022. doi: 10.48550/arXiv.2204.02311. URL https://doi.org/10.48550/arXiv.2204.02311.
  16. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022.
  17. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html.
  18. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
  19. Llm.int8(): 8-bit matrix multiplication for transformers at scale. CoRR, abs/2208.07339, 2022. doi: 10.48550/arXiv.2208.07339. URL https://doi.org/10.48550/arXiv.2208.07339.
  20. Findings of the 2021 conference on machine translation (wmt21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88. Association for Computational Linguistics, 2021.
  21. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  22. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.
  23. Breaking the sequential dependency of llm inference using lookahead decoding, November 2023. URL https://lmsys.org/blog/2023-11-21-lookahead-decoding/.
  24. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  25. Google. Bard. https://blog.google/technology/ai/try-bard/, 2023.
  26. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
  27. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  28. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.
  29. Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv, 2208, 2022.
  30. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. CoRR, abs/1804.06826, 2018. URL http://arxiv.org/abs/1804.06826.
  31. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
  32. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  33. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  34. Fp8 quantization: The power of the exponent. arXiv preprint arXiv:2208.09225, 2022.
  35. Efficient memory management for large language model serving with pagedattention, 2023.
  36. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022.
  37. Fast inference from transformers via speculative decoding. CoRR, abs/2211.17192, 2022. doi: 10.48550/arXiv.2211.17192. URL https://doi.org/10.48550/arXiv.2211.17192.
  38. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  39. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  40. Eagle: Speculative sampling requires rethinking feature uncertainty, 2024.
  41. Z. Lin and M. Riedl. Plug-and-blend: A framework for controllable story generation with blended control codes. arXiv preprint arXiv:2104.04039, 2021.
  42. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  43. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867, 2023.
  44. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
  45. Specinfer: Accelerating generative LLM serving with speculative inference and token tree verification. CoRR, abs/2305.09781, 2023. doi: 10.48550/ARXIV.2305.09781. URL https://doi.org/10.48550/arXiv.2305.09781.
  46. Microsoft. Github copilot. https://github.com/features/copilot.
  47. Co-writing screenplays and theatre scripts with language models: Evaluation by industry professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–34, 2023.
  48. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456, 2020.
  49. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  50. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  51. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309, 2023.
  52. NVIDIA. Fastertransformer. https://github.com/NVIDIA/FasterTransformer.
  53. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  54. Pytorch: An imperative style, high-performance deep learning library. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8024–8035, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
  55. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pages 754–768. IEEE, 2022.
  56. Efficiently scaling transformer inference. CoRR, abs/2211.05102, 2022. doi: 10.48550/arXiv.2211.05102. URL https://doi.org/10.48550/arXiv.2211.05102.
  57. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  58. Unsupervised translation of programming languages. Advances in Neural Information Processing Systems, 33:20601–20611, 2020.
  59. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  60. N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150.
  61. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  62. M. N. Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms. 2023. URL www.mosaicml.com/blog/mpt-7b.
  63. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  64. Facebook ai wmt21 news translation task submission. arXiv preprint arXiv:2108.03265, 2021.
  65. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  66. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  67. Greener yet powerful: Taming large code generation models with quantization. CoRR, abs/2303.05378, 2023. doi: 10.48550/arXiv.2303.05378. URL https://doi.org/10.48550/arXiv.2303.05378.
  68. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  69. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  70. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/adf7fa39d65e2983d724ff7da57f00ac-Abstract-Conference.html.
  71. Simple and effective noisy channel modeling for neural machine translation. arXiv preprint arXiv:1908.05731, 2019.
  72. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852, 2022.
  73. Big bird: Transformers for longer sequences. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html.
  74. A survey on knowledge-enhanced pre-trained language models. arXiv preprint arXiv:2212.13428, 2022.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 50 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube