Sliced ReLU Attention Mechanism
- Sliced ReLU attention is a non-symmetric, quasi-linear kernel mechanism that uses one-dimensional ReLU projections to compute query–key differences efficiently.
- It employs efficient sorting and prefix-sum algorithms to reduce computational complexity, enabling scalable attention for long sequence contexts.
- Empirical evaluations show that this mechanism preserves softmax's universality while often outperforming it on benchmarks like LRA and Tiny ViT.
Sliced ReLU attention is an attention mechanism designed to achieve quasi-linear computational complexity while preserving strong theoretical and practical expressivity, particularly for long sequence contexts. Unlike softmax and prior ReLU-based alternatives, it introduces a fundamental structural departure by applying the one-dimensional ReLU function to learnable projections of query–key differences, then leveraging efficient sorting and prefix-sum computations. This yields a fully differentiable, non-symmetric kernel suitable for scalable attention in transformer architectures, and rigorously retains universality properties previously established only for softmax attention (Boufadène et al., 12 Dec 2025).
1. Formal Definition and Kernel Structure
Let denote a sequence of input tokens. Sliced ReLU attention is defined by the following elements:
- : learned query and key projections.
- : value projection.
- : a learned one-dimensional projection, typically parameterized by a small MLP.
For each query and key : The raw sliced ReLU kernel is
$K(s_i, t_j) = \ReLU(s_i - t_j) = \max\{s_i - t_j,\, 0\}$
This kernel is inherently non-symmetric: $\ReLU(s_i-t_j) \neq \ReLU(t_j-s_i)$. To avoid denominators vanishing in the attention mechanism, normalization is by the sum of absolute differences: $\mathcal{A}_{\theta, \Pi}^{\ReLU}(x_i; X) = \sum_{j=1}^n \frac{\ReLU(\Pi Q x_i - \Pi K x_j)} {\sum_{l=1}^n |\Pi Q x_i - \Pi K x_l|} \left(V x_j - \frac{1}{n} \sum_{m=1}^n V x_m\right)$ Centering by its empirical mean places the kernel in the zero-sum subspace, where $\ReLU$ is conditionally positive definite (CPD). The kernel can also be written as an asymmetric form of the 1D energy distance kernel using $\ReLU(x-y) = \frac{1}{2}|x-y| + \frac{1}{2}(x-y)$ (Boufadène et al., 12 Dec 2025).
2. Quasi-linear Algorithm via Sorting and Prefix Sums
A direct computation of all pairwise ReLU scores is quadratic in , but the architecture exploits the 1D structure to achieve complexity:
- Sort the projected scalars: .
- Let denote the centered value vectors.
- For each position ,
$\sum_{j=1}^n \ReLU(z_{(i)} - z_{(j)})\, \gamma_{(j)}$
reduces to prefix sums: These operations are performed via a single linear scan after sorting, achieving complexity for the full attention head. With appropriate data structures, unsorting is or (Boufadène et al., 12 Dec 2025).
3. Theoretical Expressivity and Universality
Sequence-to-sequence Disentangling
Sliced ReLU attention retains the sequence-to-sequence contextual expressivity of softmax attention. Given families of labeled source and target sequences with pairwise-distinct tokens, there exists a composition of at most $2p(n+1)-1$ ReLU-attention layers mapping each source to its corresponding target with no inter-sequence interference: The proof relies on repeated applications of a 1D "splitting lemma" to separate and adjust sequences, interleaved with local updates via finite ReLU combinations.
Contextual Universal Approximation in the Mean-field Limit
Consider a setting where input sequences are probability measures . A single sliced ReLU head at test point is: $\Gamma_{\theta}(x,\mu) = x + \sum_{h=1}^H W^h\, \frac{\int \ReLU(\Pi Q^h x - \Pi K^h y) V^h y\, d\mu(y)} {\int |\Pi Q^h x - \Pi K^h z|\, d\mu(z)}$ By composing such layers (potentially with interleaved pointwise MLPs) to produce maps
it follows that for any continuous and , there exists such a composite network approximating to accuracy (Boufadène et al., 12 Dec 2025).
4. Comparison with Softmax and Prior Attention Mechanisms
| Kernel | Symmetry | Complexity | Universality | Expressivity on Long Contexts |
|---|---|---|---|---|
| Softmax | Symmetric | Yes | Well-studied, strong | |
| Linearized/Low-rank (Performers, Linformer) | Mixed | Often lost | Typically reduced on long range | |
| Prior ReLU (dot-product) | Mixed | Less stable | No guarantee | |
| Sliced ReLU | Non-symmetric | Yes | Strong, preserves softmax property |
Sliced ReLU attention replaces high-dimensional inner products by learned 1D projections and uses asymmetric ReLU kernels, achieving exact global interactions in complexity. Unlike feature-map-based linearizations which often degrade recall or expressivity on long input sequences, sliced ReLU matches the theoretical universality of softmax attention (Boufadène et al., 12 Dec 2025).
5. Empirical Evaluation
Sliced ReLU attention was evaluated across several small-scale benchmarks:
- Long Range Arena (LRA): Tasks included ListOps (2K tokens), byte-text classification (4K), retrieval (2×4K), CIFAR-10 (1K), and Pathfinder (1K). Average accuracy: Softmax 59.8%, Sliced ReLU 62.9%. Sliced ReLU outperformed softmax on retrieval and Pathfinder, and underperformed on ListOps and text. Throughput for long sequences showed 1.4×–4× gains for Sliced ReLU (e.g., ListOps, inference samples/sec: Softmax 140, Sliced ReLU 202).
- Tiny ViT on CIFAR-10 / Tiny ImageNet: 8/16-head architectures with matched patch embeddings, depth, and MLPs. Sliced ReLU-bump closely tracked Softmax accuracy; plain Sliced ReLU trailed slightly. Despite a 10% parameter increase from the MLP, ReLU kernels matched or exceeded softmax in low-capacity regimes.
- ModelNet40 point-cloud classification (Point Cloud Transformer without neighborhoods): Accuracy—Softmax 86.3%, ReLU-bump 85.4%, plain ReLU 76.2%. The ReLU-bump kernel preserved fine geometric details nearly as well as softmax, while plain ReLU was too diffuse (Boufadène et al., 12 Dec 2025).
6. Context, Limitations, and Implications
Sliced ReLU attention establishes a new point in the trade-off landscape between complexity, accuracy, and expressivity. It is fully differentiable, scales efficiently to long contexts, and theoretically preserves the universality and sequence disentangling properties of softmax attention. The reliance on sorting and prefix scans means runtime is hardware-friendly and scalable. Empirically, accuracy can slightly lag softmax in some configuration or task settings, but in several benchmarks it proves competitive or superior, particularly on tasks stressing global sequence reasoning. A plausible implication is that sliced ReLU attention enables practical scaling of transformer architectures for tasks with very long input contexts while retaining high expressivity (Boufadène et al., 12 Dec 2025).