TaylorShift: Linear Self-Attention via Taylor Expansion

Updated 20 April 2026

TaylorShift is a self-attention mechanism that leverages a Taylor expansion to achieve linear time complexity while maintaining full token-to-token interactions.
It replaces the standard Softmax with a polynomial-based normalization, thereby avoiding the quadratic bottleneck typical of traditional attention models.
Efficient algebraic factorization and stability optimizations make TaylorShift suitable for long-sequence tasks and high-resolution vision applications.

TaylorShift is a self-attention mechanism for Transformers that achieves linear time and memory complexity in sequence length while retaining full token-to-token interactions. Based on a low-order Taylor expansion of the exponential in Softmax, TaylorShift replaces standard Softmax-based attention with a polynomial-based normalization. This reformulation avoids the quadratic bottleneck of classic attention and provides precise crossover analyses, making it practically attractive for long-sequence tasks and high-resolution vision applications (Nauen et al., 2024, Nagaraju et al., 2024).

1. Mathematical Construction of TaylorShift

The canonical attention mechanism in Transformers operates via Softmax normalization: $\text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$ TaylorShift replaces the exponential with its $k$ -th order Taylor expansion and normalizes the sum: $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ where $x^{\odot n}$ is the elementwise $n$ -th power, and normalization is division by the $\ell_1$ norm. For $k=2$ (default), this yields: $1 + x + \frac{1}{2}x^{\odot2}$ Thus, attention becomes: $Y = \operatorname{T\!-\!SM}^{(k)}(d^{-\frac12} QK^\top) V$ where $Q,K,V \in \mathbb{R}^{N\times d}$ , $k$ 0 is sequence length, and $k$ 1 is embedding dimension. The 2nd-order Taylor expansion provides guaranteed positivity for even $k$ 2 and preserves the distributional properties post-normalization (Nauen et al., 2024, Nagaraju et al., 2024).

2. Efficient Algorithmic Formulation

A direct materialization of $k$ 3 scales quadratically, $k$ 4. However, TaylorShift leverages algebraic factorization to reduce this to linear in $k$ 5. The core identity is: $k$ 6 where $k$ 7 denotes the flattened outer products of rows of $k$ 8. This structure allows the computation of the attention numerator and denominator via:

Constant: sum over $k$ 9
Linear: $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ 0
Quadratic: $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ 1

A high-level pseudocode of the efficient TaylorShift workflow is as follows (Nauen et al., 2024): $n$ 7 Each main operation scales as $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ 2 or less. The interface and normalization remain consistent with quadratic TaylorShift, allowing seamless switching.

3. Computational Complexity and Crossover Regimes

The TaylorShift mechanism rigorously characterizes its complexity trade-offs:

Direct (quadratic) TaylorShift:
- FLOPs: $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ 3
- Memory: $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ 4
Efficient (linear) TaylorShift:
- FLOPs: $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ 5
- Memory: $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ 6

Key crossover points:

Compute $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ 7 yields $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ 8
Memory crossover: $\operatorname{T\!-\!SM}^{(k)}(x) := \text{normalize}\left(\sum_{n=0}^k \frac{x^{\odot n}}{n!}\right)$ 9

Empirically, memory savings emerge at $x^{\odot n}$ 0 tokens, speedup at $x^{\odot n}$ 1 tokens for typical embedding dimensions ( $x^{\odot n}$ 2, $x^{\odot n}$ 3) (Nauen et al., 2024, Nagaraju et al., 2024).

4. Empirical Evaluation and Benchmarks

TaylorShift has been validated in both natural language and vision domains:

Classification (Long-Sequence) Benchmarks: On CIFAR-pixel, IMDB-byte, Long ListOps, ImageNet-Tiny, and ImageNet-Small, TaylorShift matches or slightly exceeds vanilla attention and surpasses prior linear attention algorithms on most tasks.

Model	CIFAR	IMDB	ListOps	ImageNet Tiny	ImageNet Small	Average
Linformer	29.2	58.1	---	64.3	76.3	57.0
Performer*	34.2	65.6	35.4	62.0	67.1	52.9
Reformer	44.8	63.9	47.6	73.6	76.2	61.2
Nyströmformer	49.4	65.6	44.5	75.0	78.3	62.6
Transformer	44.7	65.8	46.0	75.6	79.1	62.2
TaylorShift (Ours)	47.6	66.0	45.6	75.0	79.3	62.7

Vision Super-Resolution (SR): TaylorShift enables pixel-level $x^{\odot n}$ 4 attention windows in SwinIR, reducing VRAM by up to 60%. For $x^{\odot n}$ 5 windows, VRAM is reduced from 78.50GB (SwinIR) to 49.10GB (TaylorSwinIR) (Nagaraju et al., 2024). Quality as measured by PSNR and SSIM remains at or above SOTA baselines:

Method	Set5 PSNR/SSIM	Urban100 PSNR/SSIM	VRAM (GB)	Reduction
SwinIR	38.35/0.9620	33.49/0.9393	78.50	0%
TaylorSwinIR (ours)	38.46/0.9627	33.71/0.9417	49.10	37%

Across five SR datasets, TaylorSwinIR consistently matches or surpasses SwinIR on both PSNR and SSIM, demonstrating the feasibility of full-range attention at linear cost.

5. Integration into Transformer Architectures

TaylorShift is a drop-in replacement for Softmax-based attention in both encoder-only and encoder-decoder Transformers, with seamless sharing of normalization and interface. In SwinIR-like architectures, TaylorShift allows the use of 1×1 patch embedding—previously impractical due to quadratic scaling—for pixel-level attention, substantially increasing the contextual receptive field. It naturally integrates with existing windowing, local-global, and positional encoding schemes without modification (Nauen et al., 2024, Nagaraju et al., 2024).

The attention block with TaylorShift executes as:

For $x^{\odot n}$ 6 (below crossover): use direct quadratic TaylorShift.
For $x^{\odot n}$ 7: use efficient tensorized variant.

6. Practical Considerations and Stability

Taylor order: $x^{\odot n}$ 8 is empirically optimal, balancing computational overhead with attention expressivity and regularization.
Input normalization: Per-token $x^{\odot n}$ 9 normalization with a learnable temperature $n$ 0 is essential for numerical stability; improper scaling leads to overflow or non-convergence.
Multi-head scaling: Efficient TaylorShift scales with the number of heads $n$ 1 as $n$ 2, permitting typical head configurations up to $n$ 3.
Switching: Selection between direct and efficient mode is automatic based on sequence length; both methods implement identical mathematical forms and output interfaces, ensuring compatibility.
Error bounds: The polynomial expansion yields provable bounds on the approximation error of Softmax, and the attention weights remain strictly positive and normalized.

7. Significance and Limitations

TaylorShift advances the landscape of efficient Transformer attention by enabling full-range, dense interactions at linear sequence scaling. Unlike sparse or kernel-based methods, it does not compromise token-to-token connectivity or introduce stateful recurrence, preserving model expressiveness in long-sequence regimes. Its use in pixel-level SR demonstrates capability in high-resolution, memory-constrained tasks, setting new practical baselines for efficiency and accuracy (Nauen et al., 2024, Nagaraju et al., 2024).

A plausible implication is that, for sufficiently large $n$ 4, TaylorShift offers the best-available trade-off between resource consumption and modeling power without loss of accuracy. Remaining limitations include the bottleneck shifting to embedding dimension $n$ 5, especially in regimes where $n$ 6 is large, and the necessity of careful normalization for stable training dynamics. For very short sequences, TaylorShift reverts to quadratic scaling, preserving compatibility with traditional Bottleneck attention settings.

References:

Nauen, J., et al. "TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-Softmax" (Nauen et al., 2024). Chang, L., et al. "A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift" (Nagaraju et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-Softmax (2024)

A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TaylorShift.