Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Attention Memory (2302.09422v2)

Published 18 Feb 2023 in cs.LG

Abstract: We propose a novel perspective of the attention mechanism by reinventing it as a memory architecture for neural networks, namely Neural Attention Memory (NAM). NAM is a memory structure that is both readable and writable via differentiable linear algebra operations. We explore three use cases of NAM: memory-augmented neural network (MANN), few-shot learning, and efficient long-range attention. First, we design two NAM-based MANNs of Long Short-term Memory (LSAM) and NAM Turing Machine (NAM-TM) that show better computational powers in algorithmic zero-shot generalization tasks compared to other baselines such as differentiable neural computer (DNC). Next, we apply NAM to the N-way K-shot learning task and show that it is more effective at reducing false positives compared to the baseline cosine classifier. Finally, we implement an efficient Transformer with NAM and evaluate it with long-range arena tasks to show that NAM can be an efficient and effective alternative for scaled dot-product attention.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  5. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  6. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4367–4375, 2018.
  9. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
  10. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
  11. Learning to transduce with unbounded memory. Advances in neural information processing systems, 28, 2015.
  12. Inductive representation learning on large graphs. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf.
  13. Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.
  14. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  15. Hyötyniemi, H. Turing machines are recurrent neural networks. Proceedings of step, 96, 1996.
  16. Inferring algorithmic patterns with stack-augmented recurrent nets. Advances in neural information processing systems, 28, 2015.
  17. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156–5165. PMLR, 2020.
  18. Neural sequence-to-grid module for learning symbolic rules. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  8163–8171, 2021.
  19. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  20. Few-shot and continual learning with attentive independent mechanisms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9455–9464, 2021.
  21. Visualbert: A simple and performant baseline for vision and language, 2019. URL https://arxiv.org/abs/1908.03557.
  22. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
  23. Infinite-former: Infinite memory transformer-former: Infinite memory transformer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5468–5485, 2022.
  24. Number sequence prediction problems for evaluating computational powers of neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  4626–4633, 2019.
  25. Compressive transformers for long-range sequence modelling. International Conference on Learning Representations, 2019.
  26. Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference, 2020. URL https://arxiv.org/abs/2011.14203.
  27. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  28. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  29. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Hyoungwook Nam (5 papers)
  2. Seung Byum Seo (2 papers)

Summary

We haven't generated a summary for this paper yet.