Attention Approximates Sparse Distributed Memory (2111.05498v2)

Published 10 Nov 2021 in cs.LG and cs.AI

Abstract: While Attention has come to be an important mechanism in deep learning, there remains limited intuition for why it works so well. Here, we show that Transformer Attention can be closely related under certain data conditions to Kanerva's Sparse Distributed Memory (SDM), a biologically plausible associative memory model. We confirm that these conditions are satisfied in pre-trained GPT2 Transformer models. We discuss the implications of the Attention-SDM map and provide new computational and biological interpretations of Attention.

Citations (32)

View on Semantic Scholar

Summary

The paper establishes a theoretical connection by showing that Transformer Attention’s softmax operation approximates SDM’s weighted summation through L2 normalization and an optimal scaling factor.
The paper demonstrates that empirical analyses on models like GPT2 reveal β coefficients that naturally converge to values reminiscent of optimal SDM configurations.
The paper suggests that components such as Feed Forward layers and LayerNorm in Transformers may function analogously to memory storage, linking deep learning with neuroscience.

An Analysis of Transformer Attention Through the Lens of Sparse Distributed Memory

The paper under review presents an intriguing theoretical exploration that relates Transformer Attention to Kanerva's Sparse Distributed Memory (SDM), a well-established model in associative memory research. This work postulates that under certain conditions, the operations performed by the Attention mechanism in Transformers approximate those of SDM. The authors mathematically formalize this approximation and substantiate it with empirical evidence, notably using the GPT2 model.

At its core, the paper seeks to demystify the effectiveness of Transformer models by linking the Attention mechanism to a biologically inspired associative memory model. SDM, developed as a model to solve the "Best Match Problem," employs a biologically plausible mechanism for storing and retrieving memories using high-dimensional binary vector spaces. In essence, the paper reveals that the softmax operation in Attention weights, which gives prominence to larger values in a vector, closely emulates the exponential function derived from the SDM's circle intersection calculations when approximated.

Mathematical Insights

The authors demonstrate that with suitable modifications—specifically, $L^2$ normalization and an appropriate scaling factor $\beta$ —the mathematical operation of softmax in Attention approximates the weighted summation process in SDM. This involves translating the operations from a binary field to a continuous space while retaining functional similarity. The insight here is significant: it provides a theoretical foundation for why Attention not only works well but does so with efficiencies comparable to a model grounded in biological principles. The use of the circle intersection calculations from SDM, mapped onto the $L^2$ normalized vectors, further aligns the two models conceptually.

Empirical Validation

Empirically, the paper closely examines the $\beta$ coefficients learned by Attention through different Transformer models, finding that they naturally converge to values reminiscent of optimal configurations in SDM. Both trained models, like the Query-Key Normalization architecture and the more ubiquitous GPT2, demonstrate effective $\beta$ values that interpolate well between optimal SDM variants. This observation underscores a critical overlap in the operational efficiencies of both models when handling real-world data, even though Transformers are engineered for general relational tasks and SDM works under random pattern assumptions.

Implications for Transformer Architecture

By establishing this connection, the paper provides interpretations for several Transformer components. It theorizes, for example, that the Feed Forward layers, which consume a substantial portion of Transformer parameters, might function analogously to SDM's memory storages. Additionally, the work highlights that the LayerNorm used in Transformers implicitly enforces a condition akin to $L^2$ normalization, which is a requisite for the Attention-SDM approximation.

Moreover, the work critiques the common interpretation of Attention weights by pointing out the influence of the varying $L^2$ norms of value vectors—a nuance critical for more accurate interpretations of what information Attention is prioritizing.

Biological Considerations

The derived Attention-SDM relationship posits a plausible neural correlate in the cerebellum—a region already speculated to perform functions closely related to SDM. This biological equivalence is noteworthy because it bridges deep learning architectures with theories of human cognition, potentially illuminating how the brain might implement similar learning and memory processes.

Future Directions

The paper encourages exploration into optimizing Transformers further by borrowing concepts from SDM and affiliated frameworks like Vector Symbolic Architectures. These could provide pathways for incorporating more symbolic reasoning within deep learning frameworks, potentially enhancing models' interpretability and robustness.

Autonomously, this work contributes to the continued integration of deep learning with neuroscience, proposing paradigms where complex neural architectures can be dissected through biologically plausible models. Bridging the gap between deep neural networks and associative memory models in neuroscience holds promise for advancing our understanding of both synthetic and natural intelligence.

In summary, this paper provides a compelling union of mathematical rigor, empirical validation, and theoretical insight into the Transformer model's design, thereby contributing significantly to the understanding of why Transformers, powered by Attention, have demonstrated exceptional capability across a gamut of machine learning tasks.