Second-Order Linearized Cross-Attention

Updated 2 April 2026

Second-order linearized cross-attention is an approximation method that uses a second-order Taylor expansion to capture nonlinearity in softmax with reduced computational complexity.
The approach leverages layer normalization, scaling, and efficient key–value summary structures to integrate first- and second-order interactions for large-scale inference.
It offers a promising balance between accuracy and efficiency for long sequences, though further empirical validation is needed to assess its practical trade-offs.

Second-order linearized cross-attention is an approximation technique for the cross-attention mechanism in neural networks, designed to achieve sub-quadratic memory and computational complexity while more closely mirroring the curvature of the standard softmax attention. By leveraging a Taylor expansion of the exponential function in the softmax, it introduces additional terms beyond the first-order (linear) kernel feature map used by standard linear transformers. This approach organizes higher-order interactions between queries and keys into summary structures that are tractable for large-scale inference when the key/query dimension is small relative to the sequence length (Mercat, 2020).

1. Mathematical Foundation

The standard softmax cross-attention computes, for queries $Q \in \mathbb{R}^{n \times d_k}$ , keys $K \in \mathbb{R}^{m \times d_k}$ , and values $V \in \mathbb{R}^{m \times d_v}$ :

$A = \mathrm{softmax}(Q K^T) V$

Linearized attention replaces the exponential kernel $\exp(q \cdot k)$ with an inner product in a feature space $\varphi$ , yielding:

$A \approx \varphi(Q) (\varphi(K)^T V)$

Second-order linearized cross-attention, as introduced by Mercat, applies a second-order Taylor expansion to the exponential:

$\exp(x) \approx 1 + x + \frac{x^2}{2}$

where $x = (\widetilde Q \widetilde K^T) / (\alpha\sqrt{d_k})$ , $\widetilde Q$ , $K \in \mathbb{R}^{m \times d_k}$ 0 are layer-normalized queries and keys, and $K \in \mathbb{R}^{m \times d_k}$ 1 is a scalar controlling the scale of $K \in \mathbb{R}^{m \times d_k}$ 2 to maintain approximation accuracy. Both numerator and denominator in attention are approximated in this manner.

The second-order term is efficiently formulated:

$K \in \mathbb{R}^{m \times d_k}$ 3

defining the second-order key summary $K \in \mathbb{R}^{m \times d_k}$ 4.

2. Algorithmic Implementation

The computation proceeds without forming the explicit $K \in \mathbb{R}^{m \times d_k}$ 5 attention matrix, relying on intermediate key–value summary matrices and tensors. The steps are:

Layer Normalization and Scaling: $K \in \mathbb{R}^{m \times d_k}$ 6, $K \in \mathbb{R}^{m \times d_k}$ 7, scale $K \in \mathbb{R}^{m \times d_k}$ 8.
First-Order Summary: $K \in \mathbb{R}^{m \times d_k}$ 9.
First-Order Numerator: $V \in \mathbb{R}^{m \times d_v}$ 0.
Second-Order Summary: For $V \in \mathbb{R}^{m \times d_v}$ 1, $V \in \mathbb{R}^{m \times d_v}$ 2.
Second-Order Numerator: For each $V \in \mathbb{R}^{m \times d_v}$ 3, $V \in \mathbb{R}^{m \times d_v}$ 4.
Zeroth-Order Numerator: $V \in \mathbb{R}^{m \times d_v}$ 5.
Combine Numerators: $V \in \mathbb{R}^{m \times d_v}$ 6.
Compute Denominator: Repeat steps 2–7 with $V \in \mathbb{R}^{m \times d_v}$ 7 set to $V \in \mathbb{R}^{m \times d_v}$ 8 for all $V \in \mathbb{R}^{m \times d_v}$ 9.
Final Attention Output: For $A = \mathrm{softmax}(Q K^T) V$ 0, $A = \mathrm{softmax}(Q K^T) V$ 1.

Both $A = \mathrm{softmax}(Q K^T) V$ 2 and $A = \mathrm{softmax}(Q K^T) V$ 3 can be constructed in a single streaming pass over the keys; chunking is possible for memory constraints.

3. Computational Complexity and Efficiency

Approximation	Time Complexity	Memory Complexity
First-order	$A = \mathrm{softmax}(Q K^T) V$ 4	$A = \mathrm{softmax}(Q K^T) V$ 5
Second-order	$A = \mathrm{softmax}(Q K^T) V$ 6	$A = \mathrm{softmax}(Q K^T) V$ 7
Softmax	$A = \mathrm{softmax}(Q K^T) V$ 8	$A = \mathrm{softmax}(Q K^T) V$ 9

Second-order linearized attention induces an extra factor of $\exp(q \cdot k)$ 0 in both time and memory over first-order, due to the $\exp(q \cdot k)$ 1 summary tensor, but remains more efficient than softmax when $\exp(q \cdot k)$ 2.

4. Kernel Feature Maps

Mercat’s formulation for the first-order mapping is $\exp(q \cdot k)$ 3, covering both constant and linear contributions in one structure. The second-order term builds on all pairwise products of query vector components multiplied by the corresponding pairwise key–value summaries.

This feature map approach allows for tractable computation of higher-order approximations. A plausible implication is that additional orders could be implemented, but with quickly rising cost.

5. Stability, Normalization, and Limitations

Stability requires careful normalization: both queries and keys are layer-normalized (affine-free) and scaled by $\exp(q \cdot k)$ 4 (empirically, $\exp(q \cdot k)$ 5 is used) to ensure that dot-products $\exp(q \cdot k)$ 6 remain in the $\exp(q \cdot k)$ 7 regime so that the Taylor expansion remains accurate. If $\exp(q \cdot k)$ 8 grows beyond approximately $\exp(q \cdot k)$ 9, the Taylor approximation degrades, and reverting to softmax or increasing $\varphi$ 0 is advised. Higher-order expansions are theoretically possible but become computationally intractable for $\varphi$ 1 unless $\varphi$ 2 is very small compared to $\varphi$ 3.

6. Empirical Evaluation and Recommendations

No empirical evaluation on real-world data sets (such as MT or image–text tasks) is reported; existing tests are on random data, and no performance benchmarks are provided versus either softmax or first-order linear attention. A plausible implication is that further research is required to assess impact on accuracy, perplexity, and runtime/memory usage in practical settings (Mercat, 2020).

Recommended usage scenarios are those in which $\varphi$ 4 is small to moderate ( $\varphi$ 5) and $\varphi$ 6 (number of keys) is very large, so that $\varphi$ 7 remains much less than $\varphi$ 8. For strict memory requirements or where only coarse approximation to the softmax is tolerable, first-order schemes are preferred. Monitoring the scale of $\varphi$ 9 is necessary to avoid Taylor approximation breakdown.

7. Context and Significance

Second-order linearized cross-attention extends techniques from "linear transformers" motivated by prior work (Katharopoulos et al., Shen et al.), which sought to deploy efficient attention via kernel tricks and low-rank approximations (Mercat, 2020). The addition of second-order terms recovers more of the nonlinearity of softmax, potentially yielding more faithful approximations in long-sequence contexts. This approach holds particular promise when system constraints make quadratic complexity in sequence length infeasible, though the ultimate trade-off between accuracy and resource consumption for real-world tasks remains to be fully characterized.

Markdown Report Issue Upgrade to Chat

References (1)

Higher Order Linear Transformer (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Second-Order Linearized Cross-Attention.