Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

Published 6 Mar 2026 in cs.LG and q-fin.CP | (2603.06875v1)

Abstract: Attention heads retrieve: given a query, they return a softmax-weighted average of stored values. We show that this computation is one step of gradient descent on a classical energy function, and that Langevin sampling from the corresponding distribution yields \emph{stochastic attention}: a training-free sampler controlled by a single temperature. Lowering the temperature gives exact retrieval; raising it gives open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model is required. We validate on four domains (64 to 4,096 dimensions). At generation temperature, stochastic attention is 2.6 times more novel and 2.0 times more diverse than the best learned baseline (a variational autoencoder trained on the same patterns), while matching a Metropolis-corrected gold standard. A simple signal-to-noise rule selects the operating temperature for any dimension. The approach requires no architectural changes and extends naturally to retrieval-augmented generation and in-context learning.