Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 183 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 82 tok/s Pro

Kimi K2 213 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Statistical Advantage of Softmax Attention: Insights from Single-Location Regression (2509.21936v1)

Published 26 Sep 2025 in cs.LG and cond-mat.dis-nn

Abstract: LLMs rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the test error and show that, while softmax is no longer Bayes-optimal, it consistently outperforms linear attention. We discuss the connection with optimization by gradient-based algorithms.