Relative Positional Attention

Updated 11 May 2026

Relative positional attention is an architectural mechanism that encodes the relative distances between tokens to capture intrinsic sequence relationships.
It employs diverse methods such as learned local tables, bias-centric forms, and kernelized approaches to overcome limitations of absolute positional encodings.
Empirical studies show improvements in translation, speech, and vision tasks, yielding better generalization and efficient long-sequence processing.

Relative positional attention is an architectural mechanism in neural sequence models, especially Transformers, that augments or replaces absolute position information with representations of the relative positions or distances between tokens. It addresses key limitations of absolute encodings when generalizing to different input lengths, adapting to data with repeated or shifted sequences, and modeling relationships invariant to global position. Relative schemes now span a rich ecosystem—from learned local tables and bias terms to kernelized parametric forms, polynomial and hyperbolic bases, and domain-specific geometric equivalents.

1. Core Mechanisms and Mathematical Formulation

The classical approach to relative positional attention is to parameterize a learnable or functional relationship between pairs of tokens based on their relative position (e.g., $\delta_{ij} = j-i$ ), and inject this information into the attention scoring function. The most foundational mechanism is due to Shaw et al. (Shaw et al., 2018), who introduced learned tables of relative position embeddings for each attention head. For input sequence $z = (z_1, z_2, \ldots, z_T)$ , a table $W = \{w_{-k},\ldots, w_0, \ldots, w_{k}\}$ parameterizes embeddings for each relative offset within a fixed window. For query (i) and key (j) positions, the modified attention score is

$e_{ij} = \frac{Q_i \cdot (K_j + a_{ij})^T}{\sqrt{d_k}}, \quad a_{ij} = w_{\delta_{ij}},\quad \delta_{ij} = \text{clip}(j-i, -k, k)$

This additive or bias-based injection of relative information is now foundational to a wide family of variants. Biases can be scalar or vector, and may depend on distance buckets, continuous kernels, or group-theoretic operators.

Many extensions incorporate additional forms—such as value-side biases (Shaw et al., 2018), explicit polynomial or kernel bases (Zhang, 5 May 2026, Chi et al., 2022, Gao, 2024), hyperbolic or log-polynomial maps (Angelotti, 2023, Gao, 2024, Chi et al., 2022), or geometric operators for structured data (Li et al., 14 Jul 2025). The form of the bias determines which aspects of the relative arrangement the model is most sensitive to: local neighborhoods, oscillatory or modulated patterns, logarithmic decay, and so forth.

2. Design Variants and Extensions

Relative positional attention has undergone substantial generalization:

Learned local tables: Models such as Transformer-XL and its ASR application (Zhou et al., 2019, Pham et al., 2020) use small windows of learnable embeddings $w_{\delta}$ , with scores often clipped to fixed radii. These are simple, robust, and sufficient for many NLP, speech, and vision tasks where position correlations are mostly local.
Bias-centric forms (ALiBi, T5, etc.): A learned or hand-designed scalar added to the attention logit for each distance, often as $b(i, j) = -\lambda |j-i|$ or a piecewise/bucketed version (Gao, 2024, Chi et al., 2022). Fast to implement and excels at length extrapolation when appropriately parameterized (e.g., logarithmic or polynomial).
Parametric/Kernelized forms: CPD (conditionally positive definite) kernels enable smooth, slowly decaying biases such as

$b(i,j) = - r_1 \log(1 + r_2 |j-i|^p)$

(Chi et al., 2022, Gao, 2024). Multi-kernel mixtures (MEP) (Gao, 2024) combine several such kernels for improved extrapolation.

Rotary and Jordan Blocks: RoPE (Zhang, 5 May 2026, 2502.01951) and its non-semisimple generalization Jordan–RoPE couple rotary phase (sin/cos) with polynomial modulation (e.g., $d e^{i \omega d}$ ) for richer expressivity on tasks involving periodic or distance-modulated signals.
Hyperbolic, Logarithmic, and Polynomial Bases: HyPE (Angelotti, 2023) uses hyperbolic sine functions ( $-\tau \sinh(\mu(j-i))$ ), offering a monotonic yet flexible bias. Logarithmic and polynomial decay forms in KERPLE and MEP cater to very long-sequence generalization (Chi et al., 2022, Gao, 2024).
Linear-time/FFT Approaches: Leveraging the Toeplitz/circulant structure of relative position bias, FastRPB and related schemes perform bias–value multiplication in $O(N \log N)$ using FFTs (Zubkov et al., 2022, Luo et al., 2021), making them suitable for long sequences.
Spiking, MLP, and Geometric Extensions: Adaptations exist for binary spiking networks (Gray-PE/Log-PE (Lv et al., 28 Jan 2025)), MLP token-mixing layers (PoSGU (Wang et al., 2022)), and multi-view 3D geometry (PRoPE (Li et al., 14 Jul 2025)).

3. Implementation and Efficiency Considerations

Relative positional encodings introduce representational power with modest computational and parameter overhead when appropriately designed:

Parameter count: Pure table-based forms (e.g., (Shaw et al., 2018, Zhou et al., 2019)) have parameter cost linear in max bucket/window size and hidden dim. Kernelized/parameteric forms (ALiBi, KERPLE, MEP, HyPE) may use only a few extra parameters per head or layer. Some approaches, especially for domain adaptation (spiking, vision MLPs), further minimize overhead via lightweight or even parameter-free encodings.
Computational complexity: Naive implementations can require $z = (z_1, z_2, \ldots, z_T)$ 0 space/time per attention layer (where $z = (z_1, z_2, \ldots, z_T)$ 1 is the sequence length), and some strategies have used sophisticated shifts, batched matmuls, or convolution/FFT techniques to bring this down to $z = (z_1, z_2, \ldots, z_T)$ 2 or linear (Zubkov et al., 2022, Luo et al., 2021, Liutkus et al., 2021).
Integration points: Relative position information may be incorporated into the attention key, value, or directly as a logit bias; architectural details (e.g., sharing embeddings across heads or layers, fusing with Q/K/V projections) differ among variants.
Compatibility: Newer forms, such as HyPE (Angelotti, 2023), are explicitly constructed for compatibility with efficient attention frameworks (e.g., FlashAttention-2), sidestepping the $z = (z_1, z_2, \ldots, z_T)$ 3 memory bottleneck.
Non-sequential data: Generalizations exist for non-contiguous (GAM (Pandya, 2022)), graph-structured, or multi-dimensional data.

4. Empirical Results and Practical Impact

Relative positional schemes have demonstrated strong empirical gains across tasks:

Machine translation and speech: Shaw et al. (Shaw et al., 2018) report +1.3 BLEU in WMT14 En→De; speech ASR and ST transformers show 0.5–1.2% absolute WER/BLEU improvements (Zhou et al., 2019, Pham et al., 2020).
Long-sequence extrapolation: Kernelized/logarithmic schemes (KERPLE (Chi et al., 2022), MEP (Gao, 2024)) decisively outperform standard absolute and exponential RPEs in perplexity at evaluation lengths 4–30× greater than training, e.g., OpenWebText2: perplexity 22.5 (ALiBi) → 21.4 (KERPLE-log) at 16K tokens (Chi et al., 2022).
Vision and MLPs: PoSGU and group-wise RPE can improve ImageNet1K accuracy by up to +2.4% while reducing parameter counts by O(N²)→O(N) or O(1) per window (Wang et al., 2022).
Spiking networks: Binary-native Gray and Log-PE deliver measurable R² and accuracy gains on time-series, text, and image patch tasks relative to absolute PE (Lv et al., 28 Jan 2025).
Geometry/3D: Projective positional encodings (PRoPE) deliver substantial robustness and accuracy increases for multi-view, variable-camera, and OOD generalization benchmarks, outperforming prior raymap and relative SE(3) forms (Li et al., 14 Jul 2025).
Efficiency and stability: FFT-based methods and kernel normalization can both accelerate inference and stabilize training for kernelized attention, overcoming variance and sharpness tradeoffs (Luo et al., 2021).

5. Theoretical Analyses and Behavior under Causal Masking

Fundamental theoretical work has clarified the effect of relative positional attention, especially in the context of masked (causal) decoders:

Bias-geometry interplay: Causal masking alone already induces a strong, monotonically increasing attention bias to nearby predecessors, even without explicit position terms. This causes downstream representations to build up a form of localized, relative distance preference (Kim et al., 25 Sep 2025, 2502.01951).
Interference with explicit encodings: The interaction of causal mask and relative PEs (e.g., RoPE, ALiBi) leads to non-trivial, often non-relative, global patterns, distorting intended distance bands (Kim et al., 25 Sep 2025). Fine-tuning the strength and decay of the explicit bias must therefore account for its combinatorial accumulation across layers and interplay with mask-induced bias (2502.01951).
Kernel families and expressivity: Group-theoretic and kernel-theoretic analyses formalize which classes of relative transformations are captured by a given schema—e.g., ALiBi encodes exponential decay, RoPE encodes translation via phase, and Jordan–RoPE couples phase and polynomial decay for mixed periodic and distance-modulated behaviors (Zhang, 5 May 2026, Chi et al., 2022). Multiple-kernel, polynomial, or log-kernel variants further extend the spectrum of attainable bias functions (Gao, 2024, Chi et al., 2022).

6. Practical Design, Limitations, and Recommendations

Relative positional attention, though powerful, must be tuned to the specific architecture and task:

Choice of bias function: Use exponential or log-based kernels for long extrapolation; mix multiple kernels (MEP) for robust smoothness at scale (Gao, 2024). Prefer logarithmic decay when maximal extrapolation is required (KERPLE-log) (Chi et al., 2022), and RoPE/Jordan blocks when periodic or modulated signals are present (Zhang, 5 May 2026).
Head and layer specialization: Assigning distinct bias slopes or kernel weights to attention heads and/or layers can further balance local and global context (Gao, 2024, Chi et al., 2022).
Limits of naive absolute+relative combinations: Empirical studies in machine translation show that mixing absolute and relative positional encodings yields no further improvement, and may introduce redundancy or conflict (Shaw et al., 2018).
Compatibility with linear-time attention: When using linear or kernelized Transformers, adopt Toeplitz/circulant FFT approaches (FastRPB) (Zubkov et al., 2022), or stochastic kernel approximations (SPE) (Liutkus et al., 2021), to maintain tractability at large sequence lengths.
Domain adaptation: In geometric and multi-view tasks, use application-specific relative encodings such as projective or SE(3) block-diagonal transforms (PRoPE) (Li et al., 14 Jul 2025); for SNNs, utilize binary-coded or quantized log-based RPE (Lv et al., 28 Jan 2025).

7. Limitations and Open Directions

While relative positional attention decisively outperforms absolute encodings in almost all generalization, robustness, and adaptability tests, certain limitations remain:

Memory and efficiency: Quadratic time and memory requirements remain for many explicit table-based schemes; recent work addresses this with FFT-based, kernelized, or highly parameterized approaches.
Non-commutativity and multi-modal data: Coupling multiple types of relative geometry, e.g., projective and phase, may raise non-commutativity or implementation concerns (Li et al., 14 Jul 2025, Zhang, 5 May 2026).
Theoretical-experimental gap: Some theoretically flexible forms (kernel mixtures, Jordan blocks) may not yield significant gains on natural text when the relevant target functions are not present (Zhang, 5 May 2026); empirical tuning and task-specific ablation remain essential.
Interaction with mask and architecture: The nontrivial interaction of mask-induced and explicit relative biases, and their depth-wise compounding, complicates extrapolation and performance prediction (Kim et al., 25 Sep 2025, 2502.01951).

In sum, relative positional attention is a paradigm-defining enhancement for sequence models, enabling robust length generalization, local-invariant reasoning, efficient context aggregation, and exploitation of structured relationships in data. Its design space is now rich and multi-faceted, supporting application- and domain-specific innovations suitable for both classical and modern Transformer architectures.