Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cached Transformers: Improving Transformers with Differentiable Memory Cache (2312.12742v1)

Published 20 Dec 2023 in cs.CV

Abstract: This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including LLMing, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as LLMing and displays the ability to be applied to a broader range of situations.

Introduction to Cached Transformers

The field of AI has seen significant developments with the introduction of the Transformer model, which revolutionized tasks like language processing and computer vision by stacking layers that utilize the self-attention mechanism. This architecture has been particularly effective because it allows each element—be it a word or image part—to interact with all other elements directly, facilitating global receptive fields and context-aware processing. However, this effectiveness comes with a steep computational cost, typically growing with the square of the sequence length, which hampers modeling long-range dependencies. A novel solution has emerged to overcome this challenge while retaining the benefits of the Transformer architecture: the Cached Transformer with a Gated Recurrent Cache (GRC).

Gated Recurrent Cache (GRC) Mechanism

The GRC mechanism serves as the cornerstone of the Cached Transformer, efficiently storing historical token representations in a compact differentiable memory cache. This enables extended and dynamic receptive fields within the Transformer structure, allowing it to account for long-term dependencies by continuously updating and retaining critical past information. The innovation hinges on a recurrent gating unit resembling those found in gated recurrent neural networks but tailored for Transformers. This mechanism has been demonstrated to lead to substantial performance improvements in a spectrum of applications, including LLMing, machine translation, image classification, object detection, and instance segmentation.

Versatility Across Tasks and Models

The versatility of GRC is evident from its compatibility and improved performance across diverse Transformer models and tasks. Integration with models such as Transformer-XL, ViT, PVT, Swin, Bigbird, and Reformer showcases not only the plug-and-play nature of GRC but also its universally beneficial impact. This adaptability has set a benchmark, marking Cached Transformers as a highly promising avenue for advancing Transformer efficiency and ability to process extensive sequential data or images.

Enhancements and Empirical Validation

Empirically, the GRC mechanism has been validated across multiple language and vision benchmarks, reliably outperforming existing models and techniques. For example, when incorporated into vision transformers, it captures instance-invariant features effectively and boosts classification accuracy through cross-sample regularization. In language tasks, it surpasses memory-based methods and is sensitive to a variety of Transformer modifications and settings. Moreover, experiments in machine translation highlight GRC's capacity to refine LLMs across different language pairs. These results collectively demonstrate the ability of GRC to enrich Transformer models, making them more adept at complex, long-range tasks without excessive computation or memory demands.

In conclusion, the introduction of the Cached Transformer with GRC offers a robust solution for the Transformer model's limitations, enhancing its ability to model long-term dependencies. Its compatibility with various Transformer architectures and tasks, coupled with its demonstrated performance benefits, presents a significant step forward in the ongoing evolution of deep learning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. ETC: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483.
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  3. Brahma, S. 2018. Improved language modeling by decoding the past. arXiv preprint arXiv:1808.05908.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  5. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35: 11079–11091.
  6. Memory transformer. arXiv preprint arXiv:2006.11527.
  7. End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
  8. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33: 9912–9924.
  9. The IWSLT 2015 Evaluation Campaign. In Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, 2–14. Da Nang, Vietnam.
  10. Report on the 11th IWSLT evaluation campaign. In Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign, 2–17. Lake Tahoe, California.
  11. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
  12. Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
  13. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
  14. Fine-Grained Classification via Categorical Memory Networks. IEEE Transactions on Image Processing, 31: 4186–4196.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  16. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
  17. Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, 2286–2296. PMLR.
  18. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426.
  19. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961–2969.
  20. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  21. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172.
  22. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451.
  23. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
  24. A cache-based natural language model for speech recognition. IEEE transactions on pattern analysis and machine intelligence, 12(6): 570–583.
  25. Kupiec, J. 1989. Probabilistic models of short and long distance word dependencies in running text. In Speech and Natural Language: Proceedings of a Workshop Held at Philadelphia, Pennsylvania, February 21-23, 1989.
  26. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988.
  27. Microsoft coco: Common objects in context. In European conference on computer vision, 740–755. Springer.
  28. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022.
  29. Retrieval augmented classification for long-tail visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6959–6969.
  30. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
  31. Listops: A diagnostic dataset for latent tree learning. arXiv preprint arXiv:1804.06028.
  32. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In NAACL-HLT (Demonstrations).
  33. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.
  34. Improving language understanding by generative pre-training.
  35. Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
  36. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507.
  37. Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, 32.
  38. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9: 53–68.
  39. Dynamic Token Normalization improves Vision Transformers. In International Conference on Learning Representations.
  40. Not all memories are created equal: Learning to forget by expiring. In International Conference on Machine Learning, 9902–9912. PMLR.
  41. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, 843–852.
  42. Long Range Arena : A Benchmark for Efficient Transformers. In International Conference on Learning Representations.
  43. Omninet: Omnidirectional representations from transformers. In International Conference on Machine Learning, 10193–10202. PMLR.
  44. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357. PMLR.
  45. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 32–42.
  46. Learning to remember translation history with a continuous cache. Transactions of the Association for Computational Linguistics, 6: 407–420.
  47. Attention is all you need. Advances in neural information processing systems, 30.
  48. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787.
  49. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
  50. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 568–578.
  51. PVT v2: Improved baselines with Pyramid Vision Transformer. Computational Visual Media, 1–10.
  52. Cross-batch memory for embedding learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6388–6397.
  53. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 22–31.
  54. Memorizing transformers. arXiv preprint arXiv:2203.08913.
  55. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 558–567.
  56. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33: 17283–17297.
  57. Invariance matters: Exemplar memory for domain adaptive person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 598–607.
  58. Long-short transformer: Efficient transformers for language and vision. Advances in Neural Information Processing Systems, 34.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhaoyang Zhang (273 papers)
  2. Wenqi Shao (89 papers)
  3. Yixiao Ge (99 papers)
  4. Xiaogang Wang (230 papers)
  5. Jinwei Gu (62 papers)
  6. Ping Luo (340 papers)
Citations (2)
Youtube Logo Streamline Icon: https://streamlinehq.com