Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

12 1

Cached Transformers: Improving Transformers with Differentiable Memory Cache (2312.12742v1)

Published 20 Dec 2023 in cs.CV

Abstract: This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including LLMing, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as LLMing and displays the ability to be applied to a broader range of situations.

PDF HTML Abstract

Introduction to Cached Transformers

The field of AI has seen significant developments with the introduction of the Transformer model, which revolutionized tasks like language processing and computer vision by stacking layers that utilize the self-attention mechanism. This architecture has been particularly effective because it allows each element—be it a word or image part—to interact with all other elements directly, facilitating global receptive fields and context-aware processing. However, this effectiveness comes with a steep computational cost, typically growing with the square of the sequence length, which hampers modeling long-range dependencies. A novel solution has emerged to overcome this challenge while retaining the benefits of the Transformer architecture: the Cached Transformer with a Gated Recurrent Cache (GRC).

Gated Recurrent Cache (GRC) Mechanism

The GRC mechanism serves as the cornerstone of the Cached Transformer, efficiently storing historical token representations in a compact differentiable memory cache. This enables extended and dynamic receptive fields within the Transformer structure, allowing it to account for long-term dependencies by continuously updating and retaining critical past information. The innovation hinges on a recurrent gating unit resembling those found in gated recurrent neural networks but tailored for Transformers. This mechanism has been demonstrated to lead to substantial performance improvements in a spectrum of applications, including LLMing, machine translation, image classification, object detection, and instance segmentation.

Versatility Across Tasks and Models

The versatility of GRC is evident from its compatibility and improved performance across diverse Transformer models and tasks. Integration with models such as Transformer-XL, ViT, PVT, Swin, Bigbird, and Reformer showcases not only the plug-and-play nature of GRC but also its universally beneficial impact. This adaptability has set a benchmark, marking Cached Transformers as a highly promising avenue for advancing Transformer efficiency and ability to process extensive sequential data or images.

Enhancements and Empirical Validation

Empirically, the GRC mechanism has been validated across multiple language and vision benchmarks, reliably outperforming existing models and techniques. For example, when incorporated into vision transformers, it captures instance-invariant features effectively and boosts classification accuracy through cross-sample regularization. In language tasks, it surpasses memory-based methods and is sensitive to a variety of Transformer modifications and settings. Moreover, experiments in machine translation highlight GRC's capacity to refine LLMs across different language pairs. These results collectively demonstrate the ability of GRC to enrich Transformer models, making them more adept at complex, long-range tasks without excessive computation or memory demands.

In conclusion, the introduction of the Cached Transformer with GRC offers a robust solution for the Transformer model's limitations, enhancing its ability to model long-term dependencies. Its compatibility with various Transformer architectures and tasks, coupled with its demonstrated performance benefits, presents a significant step forward in the ongoing evolution of deep learning models.

PDF Markdown Bookmark Chat (Pro)

References (58)

Authors (6)

Zhaoyang Zhang (273 papers)
Wenqi Shao (89 papers)
Yixiao Ge (99 papers)
Xiaogang Wang (230 papers)
Jinwei Gu (62 papers)
Ping Luo (340 papers)

Citations (2)

View on Semantic Scholar

Tweets

https://twitter.com/22146921/status/1737955972164030519

https://twitter.com/123543935/status/1737970955274620943

https://twitter.com/398056003/status/1739719693831643445

YouTube

Show All Videos