Linear Attention Neural Operator

Updated 16 November 2025

Linear Attention Neural Operator is a neural operator architecture that refactors classical self-attention into scalable, linear-complexity operations for PDE applications.
It employs low-rank factorizations, agent-based aggregations, and slice-wise attention to significantly cut compute and memory costs while preserving accuracy.
Empirical results show up to a 40% reduction in parameters and state-of-the-art performance on benchmarks like Airfoil and Navier-Stokes, highlighting its practical efficiency.

A Linear Attention Neural Operator (LinearNO) is a class of neural operator architectures that reformulate the classical self-attention mechanism—ubiquitous in Transformer models—to achieve linear complexity in both computational time and memory. This design enables scalable and accurate learning of mappings between function spaces, typically for data-driven solvers of partial differential equations (PDEs), including tasks over high-resolution unstructured meshes and multidimensional grids. By leveraging low-rank factorizations, agent-based global interactions, or physics-derived slicing-and-deslicing procedures, LinearNO operators maintain the expressive fidelity of full attention while reducing parameter counts and computational resources. Recent works have demonstrated that both physics-inspired attention (Physics-Attention) and agent-based mechanisms can be recast as special cases of canonical linear attention, yielding substantial gains in efficiency and generalization across a spectrum of scientific machine learning domains.

1. Motivation: Quadratic Complexity in Neural Operator Attention

Standard self-attention forms an $\mathbb{R}^{N\times N}$ affinity (score) matrix by querying $N$ points—each with $d$ -dimensional features—against one another, as in $A = \mathrm{softmax}(QK^\top / \sqrt{d})V$ . The $O(N^2)$ scaling of both compute and memory becomes prohibitive for real-world PDE applications, such as the simulation of 3D turbulence or inference on unstructured meshes with $N$ in the range $10^4$ – $10^5$ .

Physics-Attention, as introduced in Transolver, projects grid points into $M$ physics-modes ("slices"), operating attention within lower-dimensional Monte Carlo integrals and mapping outputs back to the full domain via deslicing. This yields $O(NM + M^2)$ complexity, a substantial reduction but still includes a quadratic $M^2$ term in $M$ (the number of slices).

LinearNO architectures are motivated by the empirical observation that the $N\times N$ softmax attention map is typically low-rank with respect to $N$ . Thus, suitable factorization—through fixed or learned projections, agent pooling, or slice-based kernels—recovers most of the expressivity with $O(NM)$ or $O(MNd)$ cost, for moderate $M$ .

2. Linear Attention Formulations: From Slicing to Agent-Based Mechanisms

Physics-Attention can be rigorously reformulated as a special case of linear attention, as shown in (Hu et al., 9 Nov 2025). The core update at each query $x_i$ becomes

$y_i = \phi(Q_i)\,\Bigl(\psi(K)^\top V\Bigr)$

where $\phi(Q_i)$ and $\psi(K_k)$ are feature maps derived by linear or nonlinear projections, and $V_k$ are value projections. This avoids explicit $N\times N$ interactions by aggregating context along the lower-dimensional $M$ slice axis, with normalizations implemented via softmax.

Agent-based variants, as in LANO (Zhong et al., 19 Oct 2025), further generalize this structure by introducing $M\ll N$ agent tokens. These mediate all global interactions in two softmax stages:

Agent aggregation: $Y_{\mathrm{agg}} = \mathrm{softmax}(A K^\top / \sqrt{d})V$ (agent collects global information)
Agent-mediated attention: $O = \mathrm{softmax}(Q A^\top / \sqrt{d}) Y_{\mathrm{agg}}$ (query attends to agent summary)

This composite operation can be interpreted as an implicit low-rank factorization of the full attention operator $QK^\top$ , enforcing normalization, positivity, and discretization-invariance.

3. Canonical Linear Attention Architectures

LinearNO instantiates canonical linear attention through asymmetric query/key projections, slice-wise softmaxes, and pre-aggregation of context:

Compute $\Phi = \mathrm{Softmax}(\mathrm{Linear}^{(Q)}(H))$ , $\Psi = \mathrm{Softmax}(\mathrm{Linear}^{(K)}(H))$
Aggregate context: $C = \Psi^\top V$
Output: $Y = \Phi C$

The standard layer applies a position-wise MLP and residual connection to yield the next-layer features. Agent-based attention is implemented efficiently with learnable pooling matrices and multi-head decomposition ( $M$ slices or $M$ agents per head).

Typical configurations include:

Hyperparameter	Value(s)
Layers ( $L$ )	8 (Airfoil, Pipe, etc.)
Slices/agents ( $M$ )	32–64
Hidden dim ( $d_h$ )	128–256
Heads	8
FFN size	$4 d_h$
Activation	GeLU

Initialization and optimization employ AdamW, with learning rates of $10^{-3}$ and OneCycle schedule.

4. Computational and Memory Efficiency

By design, LinearNO attains linear complexity:

Time/memory per layer: $O(NM)$ (canonical linear attention, agent-based mechanisms)
Compared to Transolver's Physics-Attention ( $O(NM + M^2)$ ), LinearNO eliminates the $M^2$ slice-deslice cost
Empirical reductions: on six PDE benchmarks (Hu et al., 9 Nov 2025), LinearNO saves $\approx 40\%$ in parameters and $\approx 36.2\%$ in compute compared to Physics-Attention

Example benchmark statistics for parameter and FLOP savings:

Benchmark	Transolver Params (GB)	LinearNO Params (GB)	Reduction
Airfoil	2.81	1.77	37.0 %
Pipe	3.07	1.77	42.3 %
NS	11.23	3.38	69.9 %

Benchmark	Transolver Cost (GFLOPs)	LinearNO (GFLOPs)	Saving
Airfoil	32.38	21.34	34.1 %
NS	46.16	15.53	66.4 %

Measured memory usage for 3D turbulence (LAFNO (Peng et al., 2022)) drops from theoretical $>2$ TB (quadratic) to $\sim38$ MB for the linear attention score matrix, with total GPU peak usage $\sim35.8$ GB.

5. Theoretical Properties and Approximation Guarantees

LANO (Zhong et al., 19 Oct 2025) establishes universal approximation capability for continuous operators between Sobolev spaces, under mild conditions on layer depth ( $L$ ), agent count ( $M$ ), embedding dimension ( $d$ ), and FFN size. This means for any compact subset $K$ and $\varepsilon>0$ , there exists a LinearNO model $G_\theta$ such that

$\sup_{a\in K} \| G^\dagger(a) - G_\theta(a) \|_{W^{s_2, p_2}} < \varepsilon$

The structural proof relates the agent-based attention to averaging kernels and exploits the expressivity of multi-layer perceptrons in the embedding and decoding stages. Conditioning and stability are favored by the use of pre-LayerNorm and residual scaling; the Jacobian norm per block is bounded, ensuring controlled Lipschitz constants and stable training.

6. Empirical Results on PDE Benchmarks

LinearNO was evaluated across six standard PDE benchmarks (relative $L_2$ error):

Task	Transolver ↓	LinearNO ↓
Airfoil	0.0051	0.0049
Pipe	0.0027	0.0024
Plasticity	0.0012	0.0011
NS	0.0683	0.0650
Darcy	0.0051	0.0050
Elasticity	0.0052	0.0050

Across industrial datasets (AirfRANS, Shape-Net Car), LinearNO attains state-of-the-art accuracy on field predictions and aerodynamic coefficients:

Model	Volume ↓	Surface ↓	$C_L$ ↓	$\rho_L$ ↑
Transolver	0.0122	0.0550	0.1622	0.9904
LinearNO	0.0112	0.0372	0.2400	0.9951

A plausible implication is that the agent-based and slice-based aggregations in LinearNO contribute to superior cross-resolution generalization, as evidenced by zero-shot mesh transfer experiments. LANO also demonstrates mesh-refinement consistency and avoids discretization artifacts inherent in quadratic attention.

7. Extensions, Limitations, and Open Directions

LinearNO frameworks support several potential extensions:

Dynamic adjustment of agent or slice count ( $M$ ) per task or layer
Integration of spatially/mesh-aware biases into agent features or physics-modes
Coupling with physics-informed loss functions (PINNs, conservation laws) for robust long-time rollouts
Adaptation to complex or irregular domains via geometry-aware operators (Geo-FNO)
Hierarchical agent strategies to accommodate extremely high-dimensional or 3D problems

Limitations primarily arise in error accumulation over recurrent prediction horizons, especially for physical simulation tasks. While linear attention stabilizes high-dimensional context modeling, further research is warranted into incorporating hard physical constraints, adaptive multi-scale architectures, and automated mesh resolution management.

In sum, the Linear Attention Neural Operator family advances neural PDE solvers by recasting expensive quadratic attention in tractable, expressive linear forms. The derived models achieve stringent accuracy benchmarks, provable approximation properties, substantial resource efficiency, and robust generalization across diverse scientific domains.

Markdown Report Issue Upgrade to Chat

References (3)

Transolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention (2025)

Efficient High-Accuracy PDEs Solver with the Linear Attention Neural Operator (2025)

Linear attention coupled Fourier neural operator for simulation of three-dimensional turbulence (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Linear Attention Neural Operator (LinearNO).

Linear Attention Neural Operator

1. Motivation: Quadratic Complexity in Neural Operator Attention

2. Linear Attention Formulations: From Slicing to Agent-Based Mechanisms

3. Canonical Linear Attention Architectures

4. Computational and Memory Efficiency

5. Theoretical Properties and Approximation Guarantees

6. Empirical Results on PDE Benchmarks

7. Extensions, Limitations, and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Linear Attention Neural Operator

1. Motivation: Quadratic Complexity in Neural Operator Attention

2. Linear Attention Formulations: From Slicing to Agent-Based Mechanisms

3. Canonical Linear Attention Architectures

4. Computational and Memory Efficiency

5. Theoretical Properties and Approximation Guarantees

6. Empirical Results on PDE Benchmarks

7. Extensions, Limitations, and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research