Uniform Memory Retrieval with Larger Capacity for Modern Hopfield Models (2404.03827v3)
Abstract: We propose a two-stage memory retrieval dynamics for modern Hopfield models, termed $\mathtt{U\text{-}Hop}$, with enhanced memory capacity. Our key contribution is a learnable feature map $\Phi$ which transforms the Hopfield energy function into kernel space. This transformation ensures convergence between the local minima of energy and the fixed points of retrieval dynamics within the kernel space. Consequently, the kernel norm induced by $\Phi$ serves as a novel similarity measure. It utilizes the stored memory patterns as learning data to enhance memory capacity across all modern Hopfield models. Specifically, we accomplish this by constructing a separation loss $\mathcal{L}_\Phi$ that separates the local minima of kernelized energy by separating stored memory patterns in kernel space. Methodologically, $\mathtt{U\text{-}Hop}$ memory retrieval process consists of: (Stage I) minimizing separation loss for a more uniform memory (local minimum) distribution, followed by (Stage II) standard Hopfield energy minimization for memory retrieval. This results in a significant reduction of possible metastable states in the Hopfield energy function, thus enhancing memory capacity by preventing memory confusion. Empirically, with real-world datasets, we demonstrate that $\mathtt{U\text{-}Hop}$ outperforms all existing modern Hopfield models and state-of-the-art similarity measures, achieving substantial improvements in both associative memory retrieval and deep learning tasks. Code is available at https://github.com/MAGICS-LAB/UHop ; future updates are on arXiv:2404.03827
- Meta-learning deep energy-based memory models. arXiv preprint arXiv:1910.02720, 2019.
- Low-rank bottleneck in multi-head attention models. In International conference on machine learning, pages 864–873. PMLR, 2020.
- Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems, 34:17413–17426, 2021a.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Skyformer: Remodel self-attention with gaussian kernel and nyström method. Advances in Neural Information Processing Systems, 34:2122–2135, 2021b.
- Adaptively sparse transformers. arXiv preprint arXiv:1909.00015, 2019.
- On a model of associative memory with huge storage capacity. Journal of Statistical Physics, 168:288–299, 2017. URL https://arxiv.org/abs/1702.01929.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
- John J Hopfield. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the national academy of sciences, 81(10):3088–3092, 1984.
- On sparse modern hopfield model, 2023. URL https://arxiv.org/abs/2309.12673.
- Outlier-efficient hopfield layers for large transformer-based models. 2024a.
- Nonparametric modern hopfield models. 2024b.
- On computational limits of modern hopfield models: A fine-grained complexity analysis. arXiv preprint arXiv:2402.04520, 2024c.
- Kernel memory networks: A unifying framework for memory modeling. Advances in Neural Information Processing Systems, 35:35326–35338, 2022.
- Pentti Kanerva. Sparse distributed memory. MIT press, 1988.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Building transformers from neurons and astrocytes. Proceedings of the National Academy of Sciences, 120(34):e2219150120, 2023.
- Learning multiple layers of features from tiny images. 2009.
- Dense associative memory for pattern recognition. Advances in neural information processing systems, 29, 2016. URL https://arxiv.org/abs/1606.01164.
- Large associative memory problem in neurobiology and machine learning. In International Conference on Learning Representations, 2021. URL https://arxiv.org/abs/2008.06996.
- Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pages 1614–1623. PMLR, 2016. URL https://arxiv.org/abs/1602.02068.
- Sparse modern hopfield networks. In Associative Memory & Hopfield Networks in 2023, 2023. URL https://openreview.net/forum?id=zwqlV7HoaT.
- Universal hopfield networks: A general framework for single-shot associative memory models. In International Conference on Machine Learning, pages 15561–15583. PMLR, 2022.
- Storage and learning phase transitions in the random-features hopfield model. Physical Review Letters, 131(25):257301, 2023.
- Hopfield networks is all you need. arXiv preprint arXiv:2008.02217, 2020.
- Associative memories via predictive coding. Advances in Neural Information Processing Systems, 34:3874–3886, 2021.
- Understanding contrastive learning requires incorporating inductive biases. In International Conference on Machine Learning, pages 19250–19286. PMLR, 2022.
- Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2018.
- Implicit kernel attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9713–9721, 2021.
- On the convergence of the concave-convex procedure. In Advances in neural information processing systems, volume 9, pages 1759–1767, 2009. URL https://papers.nips.cc/paper_files/paper/2009/file/8b5040a8a5baf3e0e67386c2e3a9b903-Paper.pdf.
- Biological learning in key-value memory networks. Advances in Neural Information Processing Systems, 34:22247–22258, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
- Non-holographic associative memory. Nature, 222(5197):960–962, 1969.
- Stanhop: Sparse tandem hopfield model for memory-enhanced time series prediction. In The Twelfth International Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2312.17346.
- Bishop: Bi-directional cellular learning for tabular data with generalized sparse modern hopfield model. 2024.
- Controlling the bifurcations of attractors in modern hopfield networks. In Associative Memory &\&& Hopfield Networks in 2023, 2023.
- Bayespcn: A continually learnable predictive coding associative memory. Advances in Neural Information Processing Systems, 35:29903–29914, 2022.
- The concave-convex procedure. Advances in neural information processing systems, 14, 2001.
- Willard I Zangwill. Nonlinear programming: a unified approach, volume 52. Prentice-Hall Englewood Cliffs, NJ, 1969.
- Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021.