Recurrent Formulation of Softmax Attention
- Softmax attention is reformulated as sequential state updates, reducing quadratic memory costs and enabling fixed-size sequence representations.
- Recurrent approaches employ techniques like Taylor expansion and feature factorization to approximate higher-order interactions with reduced computational overhead.
- This formulation bridges RNN and Transformer architectures, supporting online inference, dynamic normalization, and scalability in diverse applications.
Softmax attention is a fundamental mechanism that enables neural models—most notably Transformers and recurrent neural networks—to perform content-based selection over variable-length sequences. The canonical formulation computes the weighted sum of values according to normalized exponential similarities between a query and sequence of keys, but this comes with notable computational and representational challenges. A recurrent formulation of softmax attention seeks to reinterpret or reimplement this mechanism via stateful, sequential updates, connecting attention both conceptually and operationally to recurrent neural networks (RNNs). This article surveys the recurrent formulation of softmax attention and its theoretical, algorithmic, and practical consequences across diverse architectures.
1. Foundations of Softmax Attention and the Need for Recurrent Reformulations
Softmax attention assigns, for a query and key-value pairs , an output
where , , and is the sequence length.
While expressive, this formulation presents two main limitations:
- Quadratic Complexity and Memory Scaling: At inference or in online settings, softmax attention requires computation per query and memory to store hidden states, which does not scale to long sequences or massive retrieval workloads (Brébisson et al., 2016, Sieber et al., 24 May 2024, Mongaras et al., 31 Jul 2025).
- No Fixed-Size Sequence Representation: Standard softmax attention requires retention of all past hidden states for query-dependent lookups; no fixed-size encoding is produced (Brébisson et al., 2016).
A recurrent reformulation recasts the attention mechanism as a sequence of state updates, either by direct recurrence (as in RNNs), by surrogate iterative optimization, or by explicit expansion of the exponential function using, for example, a Taylor series. This enables approximation, theoretical analysis, and computational benefits.
2. Mathematical Structures: Recurrence, Taylor Expansion, and State-Space Connections
Several approaches make softmax attention recurrently tractable:
- Taylor Expansion and Feature Factorization: The exponential in softmax attention can be expanded as
and each can be written as (Kronecker power), where the hidden "state" accumulates moments across time (Mongaras et al., 31 Jul 2025). Truncated versions (low-order Taylor) yield approximations—linear attention (first-order) or higher-order linear attention (Mercat, 2020).
- State-Space Formulation: The Dynamical Systems Framework expresses softmax attention, SSMs, and RNNs as generalized state recurrences:
For softmax attention, this requires an (in principle) infinite-dimensional latent corresponding to all orders in the Taylor expansion; practical approximations use finite (Sieber et al., 24 May 2024).
- Recurrent Kernel Approximation: Linear and magnitude-aware linear attention (MALA) remove the softmax nonlinearity, allowing constant-time lookups via
but neglect magnitude sensitivity found in softmax; MALA compensates for this by restoring query-dependent scaling, improving the fidelity of the recurrent approximation (Brébisson et al., 2016, Fan et al., 1 Jul 2025).
- Recurrent Softmax Regression and Optimization: Viewing attention as "softmax regression," iterative Newton or gradient descent updates yield a recurrent process for softmax-based loss minimization—with provable convergence and controllable step-wise error propagation (Deng et al., 2023, Li et al., 2023, Gao et al., 2023). These updates mimic the recurrent refinement seen in iterative self-attention.
3. Expressiveness, Information Compression, and Performance–Efficiency Trade-offs
The recurrent formulation of softmax attention touches on deep questions about the expressiveness and efficiency of attention:
- Expressiveness and Higher-order Interactions: Linear attention (only in the expansion) fails to capture the high-order multiplicative interactions fundamental to the full softmax (Mongaras et al., 31 Jul 2025). The infinite sum in the Taylor expansion encodes complex combinatorial dependencies. Empirically, approximations using higher-order terms approach softmax performance, but only the full expansion matches its accuracy.
- Fixed-Size Representation and Fast Lookup: By summarizing a sequence into a matrix (e.g., in linear attention), one achieves fixed-size memory and constant-time lookup, at the expense of information loss—especially detrimental for tasks demanding detailed token-level weighting (Brébisson et al., 2016, Fan et al., 1 Jul 2025).
- Dynamic and Differentiable Normalization: Softmax attention inherently normalizes across variable input lengths, which allows generalization and flexible adaptation. Linear attention with hard normalization cannot generalize to variable-length inputs, but softmax-based methods or appropriately scaled surrogates (e.g., scalable-softmax) retain this property (Nakanishi, 31 Jan 2025, He et al., 17 Mar 2025).
Formulation | Computational Cost | Memory | Expressiveness |
---|---|---|---|
Softmax Attention | High (all orders, normalization) | ||
Linear Attention | Lower (first-order, normalization loss) | ||
Higher-Order Linear | Intermediate (limited order) | ||
MALA | Closer to softmax |
4. Practical Implementation: Algorithmic Strategies and Empirical Performance
- Online and Streaming Attention: Recurrent formulations enable online inference by processing one input at a time and updating a state (e.g., Gated Recurrent Context for speech recognition), often eliminating the need for context window hyperparameters (Lee et al., 2020, Lin et al., 3 Mar 2025).
- Downsampling and State Compression: Some models introduce downsampling networks or other spatial/channel reduction steps to compress and recurrently integrate input features before attention, improving both efficiency and convergence in, e.g., vision settings (Ren, 2017).
- Gating and Forgetting Mechanisms: The Forgetting Transformer explicitly integrates a forget gate into softmax attention, modulating the attention scores with cumulative multiplicative factors that decay past context—a mechanism directly inspired by LSTM forget gates but retaining the global mixing capability of softmax (Lin et al., 3 Mar 2025).
- Scalable and Self-adjusting Variants: Scalable-softmax and self-adjusting softmax modify the normalization or weighting to address failures of classical softmax in long contexts (e.g., attention fading, gradient vanishing), with empirical gains in length generalization and training stability (Nakanishi, 31 Jan 2025, Zheng et al., 25 Feb 2025).
5. Theoretical Insights: Universal Approximation, Sparsity, and Structured Penalties
- Universal Approximation: Even shallow, purely attention-based (softmax-only) architectures can universally approximate continuous sequence-to-sequence functions, provided suitable parameterization and scaling (Hu et al., 22 Apr 2025). Interpolation-based constructions show that softmax attention, via anchor selection and generalized ReLU, can recover a wide class of functions—including those previously thought to require feed-forward sublayers.
- Sparsity and Structured Optimization: Regularized and structured attention mechanisms generalize softmax by integrating sparsity (sparsemax) and group penalties (fusedmax, oscarmax), implemented by iterative optimization over the probability simplex. These recurrent, differentiable updates retain key theoretical properties (convexity, group-wise Jacobian structure) and augment interpretability while maintaining competitive accuracy (Niculae et al., 2017).
- Max-Margin Selection and Recurrent Optimization Dynamics: The optimization trajectory of softmax attention under gradient descent is itself recurrent, converging in direction to max-margin SVM solutions that select "optimal" tokens. This implicit bias toward sparse, margin-maximizing selection explains the natural emergence of a "token selection" regime as training proceeds (Tarzanagh et al., 2023).
6. Current Limitations and Open Directions
- Quadratic Bottleneck vs. Expressivity: The recurrent perspective clarifies why approximations that improve efficiency (linear attention, kernelization, partial expansion) universally underperform softmax in terms of accuracy (Mongaras et al., 31 Jul 2025). Each low-order truncation sacrifices access to higher-degree interactions.
- Magnitude Awareness and Generalization: Linear attention neglects query magnitude and, as a result, loses the dynamic adaptation inherent in softmax attention—the basis for its strong generalization and spiky/selective behavior. Magnitude-aware designs (MALA) and similar modifications partially restore this adaptability without reverting to full quadratic cost (Fan et al., 1 Jul 2025).
- Hybrid and Hardware-Compatible Architectures: Variants such as softpick (rectified, non-sum-to-one attention) eliminate pathological behaviors (e.g., attention sinks, massive activations), improving quantization and hardware utilization, and are naturally compatible with blockwise or online attention algorithms as in FlashAttention (Zuhri et al., 29 Apr 2025).
- Theoretical Unification: The Dynamical Systems Framework bridges recurrent, state-space, and attention-based models, quantifying the trade-off between state size (expansion parameter ) and expressivity, and suggesting potential for more systematic exploration of recurrent-approximating attention and vice versa (Sieber et al., 24 May 2024).
7. Broader Implications and Outlook
The recurrent formulation of softmax attention does not only provide an effective approximation strategy for efficient sequence modeling; it also elucidates the core theoretical properties that distinguish softmax attention from its linear approximations. The recurrent view exposes the source of softmax's global context sensitivity, dynamic normalization, and higher-order dependency modeling—properties central to the empirical success of transformers. As model deployments demand both longer contexts and lower inference cost, research continues to explore trade-offs embodied in recurrent approximations, magnitude-aware scaling, gradient-preserving substitutions, and hardware-amenable modifications. For practitioners, these developments offer practical pathways to balance accuracy, efficiency, and scalability in real-world sequence modeling applications.