Sliced ReLU Attention Mechanism

Updated 17 December 2025

Sliced ReLU attention is a non-symmetric, quasi-linear kernel mechanism that uses one-dimensional ReLU projections to compute query–key differences efficiently.
It employs efficient sorting and prefix-sum algorithms to reduce computational complexity, enabling scalable attention for long sequence contexts.
Empirical evaluations show that this mechanism preserves softmax's universality while often outperforming it on benchmarks like LRA and Tiny ViT.

Sliced ReLU attention is an attention mechanism designed to achieve quasi-linear computational complexity while preserving strong theoretical and practical expressivity, particularly for long sequence contexts. Unlike softmax and prior ReLU-based alternatives, it introduces a fundamental structural departure by applying the one-dimensional ReLU function to learnable projections of query–key differences, then leveraging efficient sorting and prefix-sum computations. This yields a fully differentiable, non-symmetric kernel suitable for scalable attention in transformer architectures, and rigorously retains universality properties previously established only for softmax attention (Boufadène et al., 12 Dec 2025).

1. Formal Definition and Kernel Structure

Let $X = (x_1, \ldots, x_n) \subset \mathbb{R}^d$ denote a sequence of input tokens. Sliced ReLU attention is defined by the following elements:

$Q, K \in \mathbb{R}^{d \times d}$ : learned query and key projections.
$V \in \mathbb{R}^{d \times d}$ : value projection.
$\Pi: \mathbb{R}^d \to \mathbb{R}$ : a learned one-dimensional projection, typically parameterized by a small MLP.

For each query $x_i$ and key $x_j$ : $s_i = \Pi(Q x_i),\qquad t_j = \Pi(K x_j)$ The raw sliced ReLU kernel is

$K(s_i, t_j) = \ReLU(s_i - t_j) = \max\{s_i - t_j,\, 0\}$

This kernel is inherently non-symmetric: $\ReLU(s_i-t_j) \neq \ReLU(t_j-s_i)$. To avoid denominators vanishing in the attention mechanism, normalization is by the sum of absolute differences: $\mathcal{A}_{\theta, \Pi}^{\ReLU}(x_i; X) = \sum_{j=1}^n \frac{\ReLU(\Pi Q x_i - \Pi K x_j)} {\sum_{l=1}^n |\Pi Q x_i - \Pi K x_l|} \left(V x_j - \frac{1}{n} \sum_{m=1}^n V x_m\right)$ Centering $V x_j$ by its empirical mean places the kernel in the zero-sum subspace, where $\ReLU$ is conditionally positive definite (CPD). The kernel can also be written as an asymmetric form of the 1D energy distance kernel using $\ReLU(x-y) = \frac{1}{2}|x-y| + \frac{1}{2}(x-y)$ (Boufadène et al., 12 Dec 2025).

2. Quasi-linear Algorithm via Sorting and Prefix Sums

A direct computation of all $n^2$ pairwise ReLU scores is quadratic in $n$ , but the architecture exploits the 1D structure to achieve $O(n\log n)$ complexity:

Sort the projected scalars: $z_{(1)} \leq \cdots \leq z_{(n)}$ .
Let $\gamma_{(j)}$ denote the centered value vectors.
For each position $i$ ,

$\sum_{j=1}^n \ReLU(z_{(i)} - z_{(j)})\, \gamma_{(j)}$

reduces to prefix sums: $\sum_{j \leq i} (z_{(i)} - z_{(j)}) \gamma_{(j)} = \left(\sum_{j \leq i} \gamma_{(j)}\right) z_{(i)} - \sum_{j \leq i} z_{(j)} \gamma_{(j)}$ These operations are performed via a single linear scan after sorting, achieving $O(n\log n)$ complexity for the full attention head. With appropriate data structures, unsorting is $O(n)$ or $O(n\log n)$ (Boufadène et al., 12 Dec 2025).

3. Theoretical Expressivity and Universality

Sequence-to-sequence Disentangling

Sliced ReLU attention retains the sequence-to-sequence contextual expressivity of softmax attention. Given families of labeled source $\{\mathbf{x}_i\}_{i=1}^p$ and target $\{\mathbf{y}_i\}_{i=1}^p$ sequences with pairwise-distinct tokens, there exists a composition of at most $2p(n+1)-1$ ReLU-attention layers mapping each source to its corresponding target with no inter-sequence interference: $\mathcal{A}_\ell \circ \cdots \circ \mathcal{A}_1: (\mathbf{x}_1,\dots,\mathbf{x}_p) \mapsto (\mathbf{y}_1,\dots,\mathbf{y}_p)$ The proof relies on repeated applications of a 1D "splitting lemma" to separate and adjust sequences, interleaved with local updates via finite ReLU combinations.

Contextual Universal Approximation in the Mean-field Limit

Consider a setting where input sequences are probability measures $\mu \in \mathcal{P}(\mathbb{R}^d)$ . A single sliced ReLU head at test point $x$ is: $\Gamma_{\theta}(x,\mu) = x + \sum_{h=1}^H W^h\, \frac{\int \ReLU(\Pi Q^h x - \Pi K^h y) V^h y\, d\mu(y)} {\int |\Pi Q^h x - \Pi K^h z|\, d\mu(z)}$ By composing such layers (potentially with interleaved pointwise MLPs) to produce maps

$F_{\xi_L}\diamond\Gamma_{\theta_L}\diamond\cdots\diamond F_{\xi_1}\diamond\Gamma_{\theta_1} : \mathbb{R}^d\times\mathcal{P}(\mathbb{R}^d)\to\mathbb{R}^{d'}$

it follows that for any continuous $\Lambda^*: \mathbb{R}^d \times \mathcal{P}(\mathbb{R}^d) \to \mathbb{R}^{d'}$ and $\epsilon>0$ , there exists such a composite network approximating $\Lambda^*$ to accuracy $\epsilon$ (Boufadène et al., 12 Dec 2025).

4. Comparison with Softmax and Prior Attention Mechanisms

Kernel	Symmetry	Complexity	Universality	Expressivity on Long Contexts
Softmax	Symmetric	$O(n^2)$	Yes	Well-studied, strong
Linearized/Low-rank (Performers, Linformer)	Mixed	$O(n)$	Often lost	Typically reduced on long range
Prior ReLU (dot-product)	Mixed	$O(n^2)$	Less stable	No guarantee
Sliced ReLU	Non-symmetric	$O(n\log n)$	Yes	Strong, preserves softmax property

Sliced ReLU attention replaces high-dimensional inner products by learned 1D projections and uses asymmetric ReLU kernels, achieving exact global interactions in $O(n\log n)$ complexity. Unlike feature-map-based linearizations which often degrade recall or expressivity on long input sequences, sliced ReLU matches the theoretical universality of softmax attention (Boufadène et al., 12 Dec 2025).

5. Empirical Evaluation

Sliced ReLU attention was evaluated across several small-scale benchmarks:

Long Range Arena (LRA): Tasks included ListOps (2K tokens), byte-text classification (4K), retrieval (2×4K), CIFAR-10 (1K), and Pathfinder (1K). Average accuracy: Softmax 59.8%, Sliced ReLU 62.9%. Sliced ReLU outperformed softmax on retrieval and Pathfinder, and underperformed on ListOps and text. Throughput for long sequences showed 1.4×–4× gains for Sliced ReLU (e.g., ListOps, inference samples/sec: Softmax 140, Sliced ReLU 202).
Tiny ViT on CIFAR-10 / Tiny ImageNet: 8/16-head architectures with matched patch embeddings, depth, and MLPs. Sliced ReLU-bump closely tracked Softmax accuracy; plain Sliced ReLU trailed slightly. Despite a 10% parameter increase from the $\Pi$ MLP, ReLU kernels matched or exceeded softmax in low-capacity regimes.
ModelNet40 point-cloud classification (Point Cloud Transformer without neighborhoods): Accuracy—Softmax 86.3%, ReLU-bump 85.4%, plain ReLU 76.2%. The ReLU-bump kernel preserved fine geometric details nearly as well as softmax, while plain ReLU was too diffuse (Boufadène et al., 12 Dec 2025).

6. Context, Limitations, and Implications

Sliced ReLU attention establishes a new point in the trade-off landscape between complexity, accuracy, and expressivity. It is fully differentiable, scales efficiently to long contexts, and theoretically preserves the universality and sequence disentangling properties of softmax attention. The reliance on sorting and prefix scans means runtime is hardware-friendly and scalable. Empirically, accuracy can slightly lag softmax in some configuration or task settings, but in several benchmarks it proves competitive or superior, particularly on tasks stressing global sequence reasoning. A plausible implication is that sliced ReLU attention enables practical scaling of transformer architectures for tasks with very long input contexts while retaining high expressivity (Boufadène et al., 12 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Sliced ReLU attention: Quasi-linear contextual expressivity via sorting (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sliced ReLU Attention Mechanism.