Outlier-Efficient Hopfield Layers for Large Transformer-Based Models (2404.03828v2)
Abstract: We introduce an Outlier-Efficient Modern Hopfield Model (termed $\mathrm{OutEffHop}$) and use it to address the outlier inefficiency problem of {training} gigantic transformer-based models. Our main contribution is a novel associative memory model facilitating \textit{outlier-efficient} associative memory retrievals. Interestingly, this memory model manifests a model-based interpretation of an outlier-efficient attention mechanism (${\rm Softmax}_1$): it is an approximation of the memory retrieval process of $\mathrm{OutEffHop}$. Methodologically, this allows us to introduce novel outlier-efficient Hopfield layers as powerful alternatives to traditional attention mechanisms, with superior post-quantization performance. Theoretically, the Outlier-Efficient Modern Hopfield Model retains and improves the desirable properties of standard modern Hopfield models, including fixed point convergence and exponential storage capacity. Empirically, we demonstrate the efficacy of the proposed model across large-scale transformer-based and Hopfield-based models (including BERT, OPT, ViT, and STanHop-Net), benchmarking against state-of-the-art methods like $\mathtt{Clipped_Softmax}$ and $\mathtt{Gated_Attention}$. Notably, $\mathrm{OutEffHop}$ achieves an average reduction of 22+\% in average kurtosis and 26+\% in the maximum infinity norm of model outputs across four models. Code is available at \href{https://github.com/MAGICS-LAB/OutEffHop}{GitHub}; models are on \href{https://huggingface.co/collections/magicslabnu/outeffhop-6610fcede8d2cda23009a98f}{Hugging Face Hub}; future updates are on \href{https://arxiv.org/abs/2404.03828}{arXiv}.
- Conformal prediction for time series with modern hopfield networks. Advances in Neural Information Processing Systems, 36, 2024. URL https://arxiv.org/abs/2303.12783.
- Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532, 2019. URL https://arxiv.org/abs/1906.00532.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. URL https://arxiv.org/abs/2108.07258.
- Understanding and overcoming the challenges of efficient transformer quantization, 2021. URL https://arxiv.org/abs/2109.12948.
- Quantizable transformers: Removing outliers by helping attention heads do nothing. arXiv preprint arXiv:2306.12929, 2023. URL https://arxiv.org/abs/2306.12929.
- Johannes Brandstetter. Blog post: Hopfield networks is all you need, 2021. URL https://ml-jku.github.io/hopfield-layers/. Accessed: April 4, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. URL https://arxiv.org/abs/2005.14165.
- revealt does BERT look at? an analysis of BERT’s attention. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors, Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4828. URL https://aclanthology.org/W19-4828.
- On a model of associative memory with huge storage capacity. Journal of Statistical Physics, 168:288–299, 2017. URL https://arxiv.org/abs/1702.01929.
- Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 30318–30332. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c3ba4962c05c49636d4c6206a97e9c8a-Paper-Conference.pdf.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL https://aclanthology.org/N19-1423.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. URL https://arxiv.org/abs/2010.11929.
- Richard M Dudley. Central limit theorems for empirical measures. The Annals of Probability, pages 899–929, 1978. URL https://projecteuclid.org/journals/annals-of-probability/volume-6/issue-6/Central-Limit-Theorems-for-Empirical-Measures/10.1214/aop/1176995384.full.
- Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pages 5793–5831. PMLR, 2022. URL https://arxiv.org/abs/2110.10090.
- Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020. URL https://link.springer.com/article/10.1007/s11023-020-09548-1.
- Cloob: Modern hopfield networks with infoloob outperform clip. Advances in neural information processing systems, 35:20450–20468, 2022. URL https://arxiv.org/abs/2110.11316.
- Wiki-40b: Multilingual language model dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2440–2452, 2020. URL https://aclanthology.org/2020.lrec-1.297/.
- Energy transformer. arXiv preprint arXiv:2302.07253, 2023. URL https://arxiv.org/abs/2302.07253.
- John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982. URL https://www.pnas.org/doi/10.1073/pnas.79.8.2554?trk=public_post_comment-text.
- John J Hopfield. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the national academy of sciences, 81(10):3088–3092, 1984. URL https://www.pnas.org/doi/10.1073/pnas.81.10.3088.
- Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC), pages 10–14. IEEE, 2014. URL https://ieeexplore.ieee.org/document/6757323.
- On sparse modern hopfield model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2309.12673.
- Outlier-efficient hopfield layers for large transformer-based models. 2024a.
- Nonparametric modern hopfield models. 2024b.
- On computational limits of modern hopfield models: A fine-grained complexity analysis. arXiv preprint arXiv:2402.04520, 2024c.
- Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021. URL https://academic.oup.com/bioinformatics/article/37/15/2112/6128680.
- johnowhitaker. Blog post: Exploring softmax1, or “community research for the win!”, 2023. URL https://datasciencecastnet.home.blog/2023/08/04/exploring-softmax1-or-community-research-for-the-win/. Accessed: August 4, 2023.
- Marian: Cost-effective high-quality neural machine translation in c++. arXiv preprint arXiv:1805.12096, 2018. URL https://arxiv.org/abs/1805.12096.
- Attention is not only a weight: Analyzing transformers with vector norms, 2020. URL https://aclanthology.org/2020.emnlp-main.574/.
- Revealing the dark secrets of bert, 2019. URL https://arxiv.org/abs/1908.08593.
- Building transformers from neurons and astrocytes. bioRxiv, pages 2022–10, 2022. URL https://www.pnas.org/doi/10.1073/pnas.2219150120.
- Learning multiple layers of features from tiny images. 2009. URL https://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf.
- Dense associative memory for pattern recognition. CoRR, 2016. URL https://arxiv.org/abs/1606.01164.
- Large associative memory problem in neurobiology and machine learning. In International Conference on Learning Representations, 2021. URL https://arxiv.org/abs/2008.06996.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. URL https://ieeexplore.ieee.org/document/726791.
- Percy Liang. Cs229t/stat231: Statistical learning theory (winter 2016), 2016. URL https://web.stanford.edu/class/cs229t/notes.pdf.
- Fast neural networks without multipliers. IEEE Transactions on Neural Networks, 4(1):53–62, 1993. doi: 10.1109/72.182695. URL https://ieeexplore.ieee.org/document/182695.
- Evan Miller. Blog post: Attention is off by one, 2023. URL https://www.evanmiller.org/attention-is-off-by-one.html. Accessed: August 4, 2023.
- NIST handbook of mathematical functions hardback and CD-ROM. Cambridge university press, 2010. URL https://www.amazon.com/Handbook-Mathematical-Functions-Hardback-CD-ROM/dp/0521192250.
- History compression via language models in reinforcement learning. In International Conference on Machine Learning, pages 17156–17185. PMLR, 2022. URL https://arxiv.org/abs/2205.12258.
- Hopfield networks is all you need. arXiv preprint arXiv:2008.02217, 2020. URL https://arxiv.org/abs/2008.02217.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. URL https://arxiv.org/abs/1409.0575.
- Context-enriched molecule representations improve few-shot drug discovery. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XrMWUuEevr.
- Improving few-and zero-shot reaction template prediction using modern hopfield networks. Journal of chemical information and modeling, 62(9):2111–2120, 2022. URL https://pubs.acs.org/doi/10.1021/acs.jcim.1c01065.
- Robust quantization: One model to rule them all, 2020. URL https://arxiv.org/abs/2002.07686.
- On the convergence of the concave-convex procedure. In Advances in neural information processing systems, volume 9, pages 1759–1767, 2009. URL https://papers.nips.cc/paper_files/paper/2009/file/8b5040a8a5baf3e0e67386c2e3a9b903-Paper.pdf.
- Multilayer feedforward neural networks with single powers-of-two weights. IEEE Transactions on Signal Processing, 41(8):2724–2727, 1993. doi: 10.1109/78.229903. URL https://ieeexplore.ieee.org/document/229903.
- Attention is all you need. Advances in neural information processing systems, 30, 2017. URL https://arxiv.org/abs/1706.03762.
- Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 35:17402–17414, 2022. URL https://arxiv.org/abs/2209.13325.
- Modern hopfield networks and attention for immune repertoire classification. Advances in Neural Information Processing Systems, 33:18832–18845, 2020. URL https://arxiv.org/abs/2007.13505.
- Stanhop: Sparse tandem hopfield model for memory-enhanced time series prediction. arXiv preprint arXiv:2312.17346, 2023a. URL https://arxiv.org/abs/2312.17346.
- Uniform memory retrieval with larger capacity for modern hopfield models. 2024.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023b. URL https://arxiv.org/abs/2303.17564.
- Bishop: Bi-directional cellular learning for tabular data with generalized sparse modern hopfield model. 2024.
- The Concave-Convex Procedure. Neural Computation, 15(4):915–936, 04 2003. URL https://doi.org/10.1162/08997660360581958.
- Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE, 2019. URL https://arxiv.org/abs/1910.06188.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. URL https://arxiv.org/abs/2205.01068.
- Tong Zhang. Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023. URL https://tongzhang-ml.org/lt-book/lt-book.pdf.
- Informer: Beyond efficient transformer for long sequence time-series forecasting, 2021. URL https://arxiv.org/abs/2012.07436.
- Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023. URL https://arxiv.org/abs/2306.15006.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015. URL https://arxiv.org/abs/1506.06724.