Papers
Topics
Authors
Recent
Search
2000 character limit reached

TaylorShift: Linear Self-Attention via Taylor Expansion

Updated 20 April 2026
  • TaylorShift is a self-attention mechanism that leverages a Taylor expansion to achieve linear time complexity while maintaining full token-to-token interactions.
  • It replaces the standard Softmax with a polynomial-based normalization, thereby avoiding the quadratic bottleneck typical of traditional attention models.
  • Efficient algebraic factorization and stability optimizations make TaylorShift suitable for long-sequence tasks and high-resolution vision applications.

TaylorShift is a self-attention mechanism for Transformers that achieves linear time and memory complexity in sequence length while retaining full token-to-token interactions. Based on a low-order Taylor expansion of the exponential in Softmax, TaylorShift replaces standard Softmax-based attention with a polynomial-based normalization. This reformulation avoids the quadratic bottleneck of classic attention and provides precise crossover analyses, making it practically attractive for long-sequence tasks and high-resolution vision applications (Nauen et al., 2024, Nagaraju et al., 2024).

1. Mathematical Construction of TaylorShift

The canonical attention mechanism in Transformers operates via Softmax normalization: softmax(x)i=exp(xi)jexp(xj)\text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)} TaylorShift replaces the exponential with its kk-th order Taylor expansion and normalizes the sum: T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right) where xnx^{\odot n} is the elementwise nn-th power, and normalization is division by the 1\ell_1 norm. For k=2k=2 (default), this yields: 1+x+12x21 + x + \frac{1}{2}x^{\odot2} Thus, attention becomes: Y=T ⁣- ⁣SM(k)(d12QK)VY = \operatorname{T\!-\!SM}^{(k)}(d^{-\frac12} QK^\top) V where Q,K,VRN×dQ,K,V \in \mathbb{R}^{N\times d}, kk0 is sequence length, and kk1 is embedding dimension. The 2nd-order Taylor expansion provides guaranteed positivity for even kk2 and preserves the distributional properties post-normalization (Nauen et al., 2024, Nagaraju et al., 2024).

2. Efficient Algorithmic Formulation

A direct materialization of kk3 scales quadratically, kk4. However, TaylorShift leverages algebraic factorization to reduce this to linear in kk5. The core identity is: kk6 where kk7 denotes the flattened outer products of rows of kk8. This structure allows the computation of the attention numerator and denominator via:

  • Constant: sum over kk9
  • Linear: T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)0
  • Quadratic: T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)1

A high-level pseudocode of the efficient TaylorShift workflow is as follows (Nauen et al., 2024): nn7 Each main operation scales as T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)2 or less. The interface and normalization remain consistent with quadratic TaylorShift, allowing seamless switching.

3. Computational Complexity and Crossover Regimes

The TaylorShift mechanism rigorously characterizes its complexity trade-offs:

  • Direct (quadratic) TaylorShift:
    • FLOPs: T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)3
    • Memory: T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)4
  • Efficient (linear) TaylorShift:
    • FLOPs: T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)5
    • Memory: T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)6

Key crossover points:

  • Compute T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)7 yields T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)8
  • Memory crossover: T ⁣- ⁣SM(k)(x):=normalize(n=0kxnn!)\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)9

Empirically, memory savings emerge at xnx^{\odot n}0 tokens, speedup at xnx^{\odot n}1 tokens for typical embedding dimensions (xnx^{\odot n}2, xnx^{\odot n}3) (Nauen et al., 2024, Nagaraju et al., 2024).

4. Empirical Evaluation and Benchmarks

TaylorShift has been validated in both natural language and vision domains:

  • Classification (Long-Sequence) Benchmarks: On CIFAR-pixel, IMDB-byte, Long ListOps, ImageNet-Tiny, and ImageNet-Small, TaylorShift matches or slightly exceeds vanilla attention and surpasses prior linear attention algorithms on most tasks.
Model CIFAR IMDB ListOps ImageNet Tiny ImageNet Small Average
Linformer 29.2 58.1 --- 64.3 76.3 57.0
Performer* 34.2 65.6 35.4 62.0 67.1 52.9
Reformer 44.8 63.9 47.6 73.6 76.2 61.2
Nyströmformer 49.4 65.6 44.5 75.0 78.3 62.6
Transformer 44.7 65.8 46.0 75.6 79.1 62.2
TaylorShift (Ours) 47.6 66.0 45.6 75.0 79.3 62.7
  • Vision Super-Resolution (SR): TaylorShift enables pixel-level xnx^{\odot n}4 attention windows in SwinIR, reducing VRAM by up to 60%. For xnx^{\odot n}5 windows, VRAM is reduced from 78.50GB (SwinIR) to 49.10GB (TaylorSwinIR) (Nagaraju et al., 2024). Quality as measured by PSNR and SSIM remains at or above SOTA baselines:
Method Set5 PSNR/SSIM Urban100 PSNR/SSIM VRAM (GB) Reduction
SwinIR 38.35/0.9620 33.49/0.9393 78.50 0%
TaylorSwinIR (ours) 38.46/0.9627 33.71/0.9417 49.10 37%

Across five SR datasets, TaylorSwinIR consistently matches or surpasses SwinIR on both PSNR and SSIM, demonstrating the feasibility of full-range attention at linear cost.

5. Integration into Transformer Architectures

TaylorShift is a drop-in replacement for Softmax-based attention in both encoder-only and encoder-decoder Transformers, with seamless sharing of normalization and interface. In SwinIR-like architectures, TaylorShift allows the use of 1×1 patch embedding—previously impractical due to quadratic scaling—for pixel-level attention, substantially increasing the contextual receptive field. It naturally integrates with existing windowing, local-global, and positional encoding schemes without modification (Nauen et al., 2024, Nagaraju et al., 2024).

The attention block with TaylorShift executes as:

  • For xnx^{\odot n}6 (below crossover): use direct quadratic TaylorShift.
  • For xnx^{\odot n}7: use efficient tensorized variant.

6. Practical Considerations and Stability

  • Taylor order: xnx^{\odot n}8 is empirically optimal, balancing computational overhead with attention expressivity and regularization.
  • Input normalization: Per-token xnx^{\odot n}9 normalization with a learnable temperature nn0 is essential for numerical stability; improper scaling leads to overflow or non-convergence.
  • Multi-head scaling: Efficient TaylorShift scales with the number of heads nn1 as nn2, permitting typical head configurations up to nn3.
  • Switching: Selection between direct and efficient mode is automatic based on sequence length; both methods implement identical mathematical forms and output interfaces, ensuring compatibility.
  • Error bounds: The polynomial expansion yields provable bounds on the approximation error of Softmax, and the attention weights remain strictly positive and normalized.

7. Significance and Limitations

TaylorShift advances the landscape of efficient Transformer attention by enabling full-range, dense interactions at linear sequence scaling. Unlike sparse or kernel-based methods, it does not compromise token-to-token connectivity or introduce stateful recurrence, preserving model expressiveness in long-sequence regimes. Its use in pixel-level SR demonstrates capability in high-resolution, memory-constrained tasks, setting new practical baselines for efficiency and accuracy (Nauen et al., 2024, Nagaraju et al., 2024).

A plausible implication is that, for sufficiently large nn4, TaylorShift offers the best-available trade-off between resource consumption and modeling power without loss of accuracy. Remaining limitations include the bottleneck shifting to embedding dimension nn5, especially in regimes where nn6 is large, and the necessity of careful normalization for stable training dynamics. For very short sequences, TaylorShift reverts to quadratic scaling, preserving compatibility with traditional Bottleneck attention settings.

References:

Nauen, J., et al. "TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-Softmax" (Nauen et al., 2024). Chang, L., et al. "A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift" (Nagaraju et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TaylorShift.