Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

You Only Cache Once: Decoder-Decoder Architectures for Language Models (2405.05254v2)

Published 8 May 2024 in cs.CL

Abstract: We introduce a decoder-decoder architecture, YOCO, for LLMs, which only caches key-value pairs once. It consists of two components, i.e., a cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. The overall model behaves like a decoder-only Transformer, although YOCO only caches once. The design substantially reduces GPU memory demands, yet retains global attention capability. Additionally, the computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage. Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. We also extend YOCO to 1M context length with near-perfect needle retrieval accuracy. The profiling results show that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes. Code is available at https://aka.ms/YOCO.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Zoology: Measuring and improving recall in efficient language models. arXiv preprint arXiv:2312.04927, 2023.
  2. Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  4. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  5. Generating long sequences with sparse Transformers. URL https://openai.com/blog/sparse-transformers, 2019.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  7. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
  8. Flash-Decoding for long-context inference. https://crfm.stanford.edu/2023/10/12/flashdecoding.html, 2023.
  9. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  10. Data engineering for scaling language models to 128k context. ArXiv, abs/2402.10171, 2024.
  11. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  12. OpenLLaMA: An open reproduction of LLaMA. https://github.com/openlm-research/open_llama, 2023.
  13. A framework for few-shot language model evaluation, 12 2023.
  14. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
  15. Greg Kamradt. Needle in a Haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main, 2023.
  16. Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling. arXiv preprint arXiv:2311.01927, 2023.
  17. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020.
  18. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  19. Jamba: A hybrid Transformer-Mamba language model. CoRR, abs/2403.19887, 2024.
  20. SuperScaler: Supporting flexible DNN parallelization via a unified abstraction, 2023.
  21. Sequence parallelism: Making 4d parallelism possible. arXiv preprint arXiv:2105.13120, 2021.
  22. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024.
  23. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
  24. Efficiently scaling Transformer inference. ArXiv, abs/2211.05102, 2022.
  25. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  27. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  28. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  29. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  30. Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196, 2023.
  31. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  32. StableLM 3B 4E1T. https://aka.ms/StableLM-3B-4E1T.
  33. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  35. Jonathan Tow. StableLM Alpha v2 models. https://huggingface.co/stabilityai/stablelm-base-alpha-3b-v2.
  36. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010, 2017.
  37. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
  38. Augmenting language models with long-term memory. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  39. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  40. Magneto: A foundation Transformer. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 36077–36092, 2023.
  41. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  42. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
  43. Songlin Yang and Yu Zhang. FLA: A Triton-based library for hardware-efficient implementations of linear attention mechanism. https://github.com/sustcsonglin/flash-linear-attention, 2024.
  44. GLM-130B: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  45. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yutao Sun (18 papers)
  2. Li Dong (154 papers)
  3. Yi Zhu (233 papers)
  4. Shaohan Huang (79 papers)
  5. Wenhui Wang (47 papers)
  6. Shuming Ma (83 papers)
  7. Quanlu Zhang (14 papers)
  8. Jianyong Wang (38 papers)
  9. Furu Wei (291 papers)
Citations (30)

Summary

Exploring YOCO: A Novel Approach to Efficient LLMing

Introduction to YOCO

The quest for creating more efficient and powerful LLMs has led to significant architectural innovations over the last few years. Among these, YOCO (You Only Cache Once) introduces a distinctive twist on managing memory usage and processing speed for LLMs. YOCO employs a decoder-decoder architecture which crucially caches key-value (KV) pairs just once, contrary to traditional approaches where caching is repeatedly done across several layers.

Architecture of YOCO

YOCO splits its architecture into two main components: the self-decoder and the cross-decoder:

  • Self-decoder: This part is responsible for initially processing the input sequence to produce KV caches. This is done through efficient self-attention mechanisms that are designed to be light on memory, helping alleviate the intense demand large models typically place on hardware.
  • Cross-decoder: Following the self-decoder, the cross-decoder then utilizes these pre-computed KV pairs to continue processing the sequence. By reusing the cached pairs, it avoids the redundant recomputation commonly seen in other models.

This split not only promotes efficiency but mimics the behavior of decoder-only models which are advantageous for tasks like autoregressive generation by maintaining a natural flow for generating output progressively.

Efficiency Gains

The innovative structure of YOCO allows it to boost efficiency across several fronts:

  1. Memory Usage: The memory required for storing KV pairs is significantly reduced as these pairs are cached just once and shared across the decoders. This clever design cuts down the GPU memory requirements approximately by a factor correlating with the number of layers in the model.
  2. Prefilling Speed: For long inputs, YOCO’s architecture enables a form of 'early exit' during prefilling, which drastically reduces the time taken to process initial tokens before generating output tokens. For instance, on a context length of 512K tokens, YOCO trimmed the prefill latency from 180 seconds to below 6 seconds compared to traditional Transformer models optimized with Flash-Decoding and kernel fusion.
  3. Throughput and Serving Capacity: With reduced memory and quicker prefill times, YOCO can handle larger batch sizes and longer token sequences, ultimately improving throughput and the model's capacity to serve more tokens simultaneously.

Empirical Performance

When tested, YOCO demonstrated robust performance, holding up well against existing Transformer benchmarks. It showed capability in scaling with increased model sizes, training tokens, and was notably extended to handle up to 1 million token contexts with near-perfect needle retrieval accuracy—a challenging feat for many current models.

Theoretical and Practical Implications

The YOCO model architecture provides a compelling alternative for developing LLMs, especially in scenarios where memory and speed are bottlenecks. The approach blends well with the growing need for more agile and cost-effective models in practical applications, ranging from real-time language understanding to more complex multi-modal tasks where timing and response are crucial.

Speculations on Future Developments

The deployment of YOCO in settings beyond pure text-based models, such as in tasks involving multi-modal data (images, text, and audio), seems a promising avenue. Additionally, the underlying principles of YOCO could inspire further research into even more memory-efficient designs or specialized hardware implementations that could leverage its unique caching strategy.

As AI research continues to push the boundaries of what's possible with machine learning models, YOCO stands out as a valuable step towards more sustainable and scalable AI technologies. Its innovations in model architecture offer a glimpse into the future directions of AI systems, highlighting an ongoing shift towards optimization and efficiency without compromising on performance.

Youtube Logo Streamline Icon: https://streamlinehq.com