Understanding Factual Recall in Transformers via Associative Memories (2412.06538v1)

Published 9 Dec 2024 in cs.LG, cs.CL, cs.IT, math.IT, and stat.ML

Abstract: LLMs have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count. We next introduce a synthetic factual recall task, and prove that a transformer with a single layer of self-attention followed by an MLP can obtain 100% accuracy on the task whenever either the total number of self-attention parameters or MLP parameters scales (up to log factors) linearly with the number of facts. In particular, the transformer can trade off between using the value matrices or the MLP as an associative memory to store the dataset of facts. We complement these expressivity results with an analysis of the gradient flow trajectory of a simplified linear attention model trained on our factual recall task, where we show that the model exhibits sequential learning behavior.

PDF HTML Abstract

Understanding Factual Recall in Transformers via Associative Memories

The paper "Understanding Factual Recall in Transformers via Associative Memories" by Nichani, Lee, and Bietti presents an in-depth paper of how shallow transformers, specifically those with a single layer of self-attention followed by an MLP, perform factual recall tasks. Through theoretical analysis and synthetic tasks, the paper addresses the question of how transformers can achieve near-optimal storage capacity, efficiently utilizing associative memories inherent in their architectures.

Key Contributions and Findings

Associative Memory Models:
- The researchers examined both linear and MLP associative memory models, demonstrating that their storage capacities scale proportionally with their parameter count. This result is significant as it suggests that transformers can use these models to store many associations, which resonates well with the large storage demands of factual recall tasks.
- A linear associative memory model's construction showed the ability to store a number of associations on the order of $d^2$ , with $d$ being the dimensionality, by using outer product associations. The paper suggests that optimally constructed matrices using linear mappings can store information efficiently when embeddings are random.
- For MLP associative memories, capacities improve significantly by using polynomial transformations, proving that they can store instructions linearly with $md$ parameters.
Synthetic Factual Recall Task:
- The paper introduced a synthetic task to reflect factual recall, wherein a model is required to map subject and relation tokens from a noisy environment to a correct answer token. The work notably sets a foundation in understanding how transformers perform these memory-intensive tasks.
- The paper establishes the role of self-attention value matrices and the MLP as associative memories, enabling smaller transformers to achieve 100% accuracy on this task under specific storage conditions.
Optimization Dynamics:
- Analysis of gradient flow on a simplified linear attention model provided insights into the dynamics of learning. This model demonstrated sequential learning behavior, illuminating a "hallucination" phase where the model's output is contingent primarily upon relation tokens before optimizing further to correct factual recall.
Trade-off Between Self-Attention and MLP:
- The paper highlights a key trade-off between the storage of factual associations in self-attention layers versus the MLP. An overhead in one can be compensated by optimizing the capacity usage of the other.
Lower Bounds and Empirical Validation:
- Through rigorous information-theoretic analysis, the authors showed that achieving optimal storage capacity (specifically up to log factors) is consistent with the overall parameter count of the model. This was complemented by empirical validations reinforcing theoretical predictions and demonstrating that total parameter count plays a critical role in factual recall capacity.

Implications and Future Directions

The research offers theoretical and empirical evidence that combines to enhance our understanding of transformers' storage capacities and factual recall capabilities. Practically, this paper suggests that employing a combination of associative memories can lead to more efficient architectures for factual recall tasks in AI, potentially refining current models' memory efficiency and accuracy.

Future research could expand on these findings by exploring dynamic embedding strategies or adjusting priors on embeddings to optimize memory and retrieval efficiency. Furthermore, understanding how these principles can be applied to larger, more complex models could advance the development of transformers in real-world applications such as knowledge databases and LLMs.

In conclusion, the strategic use of associative memories within transformers presents a viable pathway to achieving optimal factual recall efficiency, marking a significant step towards understanding the memorization capabilities of these models.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Eshaan Nichani (15 papers)
Jason D. Lee (151 papers)
Alberto Bietti (35 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/EshaanNichani/status/1867718346462187724

https://twitter.com/albertobietti/status/1915211923949515111