Factual recall in linear associative memories: sharp asymptotics and mechanistic insights

Published 11 May 2026 in stat.ML, cond-mat.dis-nn, cond-mat.stat-mech, and cs.LG | (2605.10795v1)

Abstract: LLMs demonstrate remarkable ability in factual recall, yet the fundamental limits of storing and retrieving input--output associations with neural networks remain unclear. We study these limits in a minimal setting: a linear associative memory that maps $p$ input embeddings in $\mathbb{R}^d$ to their corresponding~$d$-dimensional targets via a single layer, requiring each mapped input to be well separated from all other targets. Unlike in supervised classification, this strict separation induces~$p$ constraints per association and produces strong correlations between constraints that make a direct characterisation of the storage capacity difficult. Here, we provide a precise characterisation of this capacity in the following way. We first introduce a decoupled model in which each input has its own independent set of competing outputs, and provide numerical and analytical evidence that this decoupled model is equivalent to the original model in terms of storage capacity, spectra of the learnt weights, and storage mechanism. Using tools from statistical physics, we show that the decoupled model can store up to $p_c \log p_c / d² = 1 / 2$ associations, and generalise the computation of $p_c$ to linear two-layer architectures. Our analysis also gives mechanistic insight into how the optimal solution improves over a naïve Hebbian learning rule: rather than boosting input-output alignments with broad fluctuations, the optimal solution raises the correct scores just above the extreme-value threshold set by the competing outputs. These findings give a sharp statistical-physics characterisation of factual storage in linear networks and provide a baseline for understanding the memory capacity of more realistic neural architectures.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper establishes a sharp capacity threshold (α_c = 1/2) for factual recall, exceeding classical Hebbian limits.
It introduces a decoupled formulation with Gaussian embeddings to enable precise statistical physics and replica method analysis.
Findings reveal that optimal memory strategies rely on boosting target scores with low variance, guiding future neural architecture design.

Factual Memory Limits and Mechanisms in Linear Associative Memories

Problem Setting and Motivation

This paper rigorously investigates the fundamental limits governing factual recall in associative memory models constructed via linear maps. In the context of LLMs, the ability to memorize factual associations—mapping input tokens (e.g., city names) to output tokens (e.g., countries)—has become a critical feature. Theoretical understanding of such limits, particularly as the number of associations ( $p$ ) and embedding dimension ( $d$ ) scale, is lacking. The authors study the optimal storage capacity and mechanistic principles underlying memorization in linear associative memories with Gaussian input and output embeddings, focusing on both full-rank and rank-constrained (two-layer) models.

Decoupled Problem Formulation and Analytical Tractability

The classical associative memory problem exhibits strong constraint correlations, rendering analytical approaches intractable. To circumvent this, the authors introduce a "decoupled" variant, where each input vector is paired with its own independent set of candidate outputs. This modification eliminates correlations between constraints, making the problem amenable to analysis using statistical physics techniques. Crucially, the decoupled model retains the essential structure of the original problem, allowing for direct extension of results.

Figure 1: Task illustration (left), showing input-output factual associations (cities to countries), and empirical accuracy versus load parameter $\alpha = p \log p / d^2$ (right) for both original and decoupled problems, confirming identical capacity thresholds.

Sharp Capacity Thresholds and Mechanistic Insights

Through both analytic derivation and extensive numerical validation, the paper establishes a sharp capacity threshold for reliable factual recall. The critical load parameter is given by $\alpha_c = p \log p / d^2 = 1/2$ , delineating the maximum number of associations that can be memorized for a given embedding dimension. Notably, this threshold significantly exceeds that predicted by classical Hebbian constructions, which saturate at $\alpha \approx 1/8$ .

Using the replica method, the authors characterize the free entropy of the solution space and demonstrate its abrupt collapse at $\alpha_c$ . As the load approaches capacity, the set of admissible weights shrinks, leading to concentration phenomena in the distribution of association scores.

Empirical Confirmation and Structural Equivalence

The paper provides three lines of evidence supporting the equivalence of the original and decoupled formulations:

Retrieval accuracy curves for both models are indistinguishable for a range of dimensions and rank fractions, confirming a common capacity threshold ( $\alpha_c$ ).
The singular value spectra of optimal weight matrices coincide between the models and match theoretical predictions.
Figure 2: Empirical accuracy versus load parameter for various rank fractions $\kappa$ , solidifying threshold equivalence and revealing performance gains beyond Hebbian limits.
The score distributions for correct and incorrect associations show that memorization is achieved not by suppressing non-target scores, but through lifting target scores just above the deterministic threshold set by the maximum competing score (approximately $\sqrt{2\log p}$ for Gaussian embeddings).
Figure 3: Singular value distributions for original (top) and decoupled (bottom) problems at capacity, overlaid with theoretical predictions, highlighting equivalence and deviation from initial spectra.

Figure 4: Aggregated histograms of target versus non-target scores for varying load levels; learning concentrates target scores at the tail of the non-target distribution, in both models.

Rank Constraints and Two-Layer Linear Networks

Extending analysis to rank-constrained weight matrices ( $W = UV^\top$ , $d$ 0), the authors derive the capacity threshold as a function of rank fraction $d$ 1:

$d$ 2

where $d$ 3 is the quarter-circle law and $d$ 4 the associated quantile function. This generalizes the full-rank result, yielding explicit relations between network architecture and memory capacity.

Finite-Size Effects and Slow Convergence

A salient feature uncovered is the slow finite-size convergence to the asymptotic threshold. Higher dimensions yield improved accuracy, but only logarithmically so; capacity threshold corrections scale as $d$ 5, causing transition points to drift at moderate sizes. This is illustrated via empirical scaling analyses.

Figure 5: Finite-size scaling of empirical critical load, demonstrating logarithmic convergence toward the $d$ 6 asymptote.

Theoretical Implications and Mechanistic Comparison

The mechanistic analysis reveals that optimal memorization does not rely on Hebbian principles—in which target score means are separated but with large variances—but on precise boosting of target scores with suppressed variance, ensuring correct association even in the presence of extensive competition. This is seen both analytically and in the narrowness of target score histograms near capacity.

Figure 6: Score distributions for Hebbian ansatz, illustrating large overlap and variance compared to optimal solutions, confirming suboptimal storage mechanism.

Practical and Theoretical Implications

By rigorously quantifying the maximal factual recall of linear associative memories, the paper informs the design of neural architectures for memorization tasks, including key-value retrieval, LLMs, and memory-augmented neural networks. The identified scaling laws ( $d$ 7) set forth absolute limits for linear map-based recall, underpinning training protocol choices and network scaling. Mechanistically, the work highlights the importance of deterministic score concentration over statistical separation, guiding future architecture and loss function designs.

Theoretically, the equivalence of the decoupled and original problems, coupled with replica-symmetric validity even for the non-convex rank-constrained case, points toward universality and tractability in high-dimensional learning systems. Extension to nonlinear multi-layer architectures and more realistic embedding distributions remains an open direction.

Conclusion

This paper provides the first precise, analytical characterization of factual memory limits in linear associative memories, via both decoupled and original formulations. The sharp transition in capacity, superior to classical Hebbian constructions, is substantiated through rigorous statistical physics techniques and numerical experiments. The results furnish mechanistic insights into optimal memory strategies and establish critical scaling laws relevant to practical neural architectures. Future research may leverage these techniques to investigate universality, extend to nonlinear settings, and adapt to real-world embedding statistics.

Markdown Report Issue