Papers
Topics
Authors
Recent
Search
2000 character limit reached

Second-Order Linearized Cross-Attention

Updated 2 April 2026
  • Second-order linearized cross-attention is an approximation method that uses a second-order Taylor expansion to capture nonlinearity in softmax with reduced computational complexity.
  • The approach leverages layer normalization, scaling, and efficient key–value summary structures to integrate first- and second-order interactions for large-scale inference.
  • It offers a promising balance between accuracy and efficiency for long sequences, though further empirical validation is needed to assess its practical trade-offs.

Second-order linearized cross-attention is an approximation technique for the cross-attention mechanism in neural networks, designed to achieve sub-quadratic memory and computational complexity while more closely mirroring the curvature of the standard softmax attention. By leveraging a Taylor expansion of the exponential function in the softmax, it introduces additional terms beyond the first-order (linear) kernel feature map used by standard linear transformers. This approach organizes higher-order interactions between queries and keys into summary structures that are tractable for large-scale inference when the key/query dimension is small relative to the sequence length (Mercat, 2020).

1. Mathematical Foundation

The standard softmax cross-attention computes, for queries QRn×dkQ \in \mathbb{R}^{n \times d_k}, keys KRm×dkK \in \mathbb{R}^{m \times d_k}, and values VRm×dvV \in \mathbb{R}^{m \times d_v}:

A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V

Linearized attention replaces the exponential kernel exp(qk)\exp(q \cdot k) with an inner product in a feature space φ\varphi, yielding:

Aφ(Q)(φ(K)TV)A \approx \varphi(Q) (\varphi(K)^T V)

Second-order linearized cross-attention, as introduced by Mercat, applies a second-order Taylor expansion to the exponential:

exp(x)1+x+x22\exp(x) \approx 1 + x + \frac{x^2}{2}

where x=(Q~K~T)/(αdk)x = (\widetilde Q \widetilde K^T) / (\alpha\sqrt{d_k}), Q~\widetilde Q, KRm×dkK \in \mathbb{R}^{m \times d_k}0 are layer-normalized queries and keys, and KRm×dkK \in \mathbb{R}^{m \times d_k}1 is a scalar controlling the scale of KRm×dkK \in \mathbb{R}^{m \times d_k}2 to maintain approximation accuracy. Both numerator and denominator in attention are approximated in this manner.

The second-order term is efficiently formulated:

KRm×dkK \in \mathbb{R}^{m \times d_k}3

defining the second-order key summary KRm×dkK \in \mathbb{R}^{m \times d_k}4.

2. Algorithmic Implementation

The computation proceeds without forming the explicit KRm×dkK \in \mathbb{R}^{m \times d_k}5 attention matrix, relying on intermediate key–value summary matrices and tensors. The steps are:

  1. Layer Normalization and Scaling: KRm×dkK \in \mathbb{R}^{m \times d_k}6, KRm×dkK \in \mathbb{R}^{m \times d_k}7, scale KRm×dkK \in \mathbb{R}^{m \times d_k}8.
  2. First-Order Summary: KRm×dkK \in \mathbb{R}^{m \times d_k}9.
  3. First-Order Numerator: VRm×dvV \in \mathbb{R}^{m \times d_v}0.
  4. Second-Order Summary: For VRm×dvV \in \mathbb{R}^{m \times d_v}1, VRm×dvV \in \mathbb{R}^{m \times d_v}2.
  5. Second-Order Numerator: For each VRm×dvV \in \mathbb{R}^{m \times d_v}3, VRm×dvV \in \mathbb{R}^{m \times d_v}4.
  6. Zeroth-Order Numerator: VRm×dvV \in \mathbb{R}^{m \times d_v}5.
  7. Combine Numerators: VRm×dvV \in \mathbb{R}^{m \times d_v}6.
  8. Compute Denominator: Repeat steps 2–7 with VRm×dvV \in \mathbb{R}^{m \times d_v}7 set to VRm×dvV \in \mathbb{R}^{m \times d_v}8 for all VRm×dvV \in \mathbb{R}^{m \times d_v}9.
  9. Final Attention Output: For A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V0, A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V1.

Both A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V2 and A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V3 can be constructed in a single streaming pass over the keys; chunking is possible for memory constraints.

3. Computational Complexity and Efficiency

Approximation Time Complexity Memory Complexity
First-order A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V4 A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V5
Second-order A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V6 A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V7
Softmax A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V8 A=softmax(QKT)VA = \mathrm{softmax}(Q K^T) V9

Second-order linearized attention induces an extra factor of exp(qk)\exp(q \cdot k)0 in both time and memory over first-order, due to the exp(qk)\exp(q \cdot k)1 summary tensor, but remains more efficient than softmax when exp(qk)\exp(q \cdot k)2.

4. Kernel Feature Maps

Mercat’s formulation for the first-order mapping is exp(qk)\exp(q \cdot k)3, covering both constant and linear contributions in one structure. The second-order term builds on all pairwise products of query vector components multiplied by the corresponding pairwise key–value summaries.

This feature map approach allows for tractable computation of higher-order approximations. A plausible implication is that additional orders could be implemented, but with quickly rising cost.

5. Stability, Normalization, and Limitations

Stability requires careful normalization: both queries and keys are layer-normalized (affine-free) and scaled by exp(qk)\exp(q \cdot k)4 (empirically, exp(qk)\exp(q \cdot k)5 is used) to ensure that dot-products exp(qk)\exp(q \cdot k)6 remain in the exp(qk)\exp(q \cdot k)7 regime so that the Taylor expansion remains accurate. If exp(qk)\exp(q \cdot k)8 grows beyond approximately exp(qk)\exp(q \cdot k)9, the Taylor approximation degrades, and reverting to softmax or increasing φ\varphi0 is advised. Higher-order expansions are theoretically possible but become computationally intractable for φ\varphi1 unless φ\varphi2 is very small compared to φ\varphi3.

6. Empirical Evaluation and Recommendations

No empirical evaluation on real-world data sets (such as MT or image–text tasks) is reported; existing tests are on random data, and no performance benchmarks are provided versus either softmax or first-order linear attention. A plausible implication is that further research is required to assess impact on accuracy, perplexity, and runtime/memory usage in practical settings (Mercat, 2020).

Recommended usage scenarios are those in which φ\varphi4 is small to moderate (φ\varphi5) and φ\varphi6 (number of keys) is very large, so that φ\varphi7 remains much less than φ\varphi8. For strict memory requirements or where only coarse approximation to the softmax is tolerable, first-order schemes are preferred. Monitoring the scale of φ\varphi9 is necessary to avoid Taylor approximation breakdown.

7. Context and Significance

Second-order linearized cross-attention extends techniques from "linear transformers" motivated by prior work (Katharopoulos et al., Shen et al.), which sought to deploy efficient attention via kernel tricks and low-rank approximations (Mercat, 2020). The addition of second-order terms recovers more of the nonlinearity of softmax, potentially yielding more faithful approximations in long-sequence contexts. This approach holds particular promise when system constraints make quadratic complexity in sequence length infeasible, though the ultimate trade-off between accuracy and resource consumption for real-world tasks remains to be fully characterized.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Second-Order Linearized Cross-Attention.