Retentive Network: A Successor to Transformer for Large Language Models (2307.08621v4)

Published 17 Jul 2023 in cs.CL and cs.LG

Abstract: In this work, we propose Retentive Network (RetNet) as a foundation architecture for LLMs, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on LLMing show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for LLMs. Code will be available at https://aka.ms/retnet.

Authors (8)

Yutao Sun (18 papers)
Li Dong (154 papers)
Shaohan Huang (79 papers)
Shuming Ma (83 papers)
Yuqing Xia (12 papers)
Jilong Xue (16 papers)
Jianyong Wang (38 papers)
Furu Wei (291 papers)

Citations (223)

View on Semantic Scholar

Summary

Retentive Network: A Successor to Transformer for LLMs

The paper "Retentive Network: A Successor to Transformer for LLMs" introduces Retentive Network (RetNet), a novel architecture designed to address the limitations of the Transformer architecture in terms of training parallelism, inference cost, and performance, particularly for LLMs. RetNet is presented as a direct contender to Transformer, offering significant improvements in efficiency and scalability while maintaining, and sometimes exceeding, the performance of its predecessor.

Key Contributions

Retention Mechanism: RetNet introduces a retention mechanism that supports three distinct computation paradigms: parallel, recurrent, and chunkwise recurrent representations. This flexibility allows the architecture to leverage parallelism for efficient training while utilizing recurrent mechanisms for inference, thereby reducing memory and computational costs.
Theoretical Foundation: The paper establishes a theoretical connection between recurrence and attention mechanisms, paving the way for the introduction of the retention mechanism. This approach combines the strengths of recurrent neural networks (RNNs) and the attention mechanism, aiming to achieve the best of both worlds.
Three Computation Paradigms:
- Parallel Representation: This paradigm is utilized during training to fully leverage GPU parallelism.
- Recurrent Representation: This supports $O(1)$ inference complexity, thereby reducing the inference cost significantly.
- Chunkwise Recurrent Representation: This allows for efficient long-sequence modeling with linear complexity.
Experimental Validation: Extensive experiments demonstrate that RetNet achieves favorable scaling results, efficient training parallelism, and low-cost inference. The results indicate that RetNet is a strong competitor to Transformer in terms of both performance and efficiency.

Numerical Results and Claims

RetNet shows a substantial reduction in GPU memory usage during inference, saving 70% of memory compared to Transformer with key-value (KV) caches when processing sequences of 8k tokens. Additionally, RetNet achieves an 8.4× improvement in decoding speed. During training, RetNet demonstrates a 25-50% reduction in memory consumption and a 7× boost in speed compared to standard Transformer models. Moreover, even when compared with FlashAttention-optimized Transformers, RetNet exhibits competitive or superior throughput and memory efficiency.

Implications and Future Directions

The implications of this research are significant both theoretically and practically:

Theoretical Implications:

The dual-form representation reinforces the connection between recurrent models and attention mechanisms. This alignment could inspire future advances in hybrid architectures that leverage these principles to further optimize performance and efficiency.

Practical Implications:

RetNet’s efficient training and inference paradigms make it highly suitable for deployment in real-world applications where resource constraints are a critical consideration. This could lead to more widespread adoption of LLMs in industry, particularly in scenarios requiring scalable and low-latency inference.

Speculative Outlook on AI Developments

Looking ahead, the introduction of RetNet could catalyze several developments within the AI field:

Scalability Enhancements:

Further optimizations in RetNet could facilitate even larger models with billions to trillions of parameters, driving advancements in model capability and performance.

Multimodal Models:

Since RetNet retains the advantageous properties of the Transformer architecture, it is well-positioned for integration into multimodal models that process and generate data across multiple formats, including text, images, and audio.

Edge Computing:

The efficiency gains in RetNet could enable the deployment of powerful LLMs on edge devices, expanding the possibilities for AI applications in mobile and remote contexts.

In conclusion, the Retentive Network represents a promising advancement in the domain of LLMs, seamlessly bridging the gap between the advantages of Transformers and the efficiency of recurrent mechanisms. The architecture's robust performance and significant efficiency improvements highlight its potential as a successor to Transformers, setting the stage for future breakthroughs in AI technology.