Relative Positional Encodings in Transformers
- Relative positional encodings are a method to encode token positions by using differences rather than absolute indices, enhancing model invariance and generalization.
- They employ various schemes such as additive biases, group actions, and kernel approximations, each tailored for different data modalities and computational constraints.
- Empirical results show that RPEs improve performance in tasks across language, vision, audio, and graph processing, with benefits in extrapolation and efficiency.
Relative positional encodings are patterns, mechanisms, or parameterizations used within Transformer models to supply positional information in a form that is invariant (or equivariant) under translation of the input, so that the model’s core operations depend only on the relative arrangement of tokens rather than their absolute sequence indices. This class of encodings underpins much of the recent progress in scaling, expressivity, and generalization of Transformers across natural language, vision, audio, and multimodal architectures. Relative encodings subsume a variety of concrete schemes, including formalisms based on Lie groups, explicit Toeplitz kernels, low-rank decompositions, learned trainable tables, and geometric invariants, each with their own mathematical and statistical properties.
1. Mathematical Foundations and General Principles
The fundamental distinction of a relative positional encoding (RPE) is its dependence on differences of position indices (or more generally, coordinates) rather than on absolute position. For sequential data, if and denote token positions (or coordinates ), the RPE intrinsically references or . In the canonical form, the attention logit in a single Transformer head is augmented as
where is a (possibly learned or fixed) bias function of relative offset. For graph and multiview data, more general structures (such as shortest-path, resistance distance, or inter-camera projective transforms) can play the role of .
Several architectures further multiply or rotate the query/key vectors with position-dependent operators , so that the attention logit is
with depending only on in the appropriate group structure. This form captures all group-action–based relative encodings (LieRE, RoPE, GRAPE, etc.) (Ostmeier et al., 2024, Zhang et al., 8 Dec 2025). The group property ensures translations or equivariant transformations do not alter the relative attention pattern, which is crucial for invariance and compositionality.
For general attention kernels , RPEs often require the existence of a decomposition such that or for some function .
2. Taxonomy of Relative Positional Encoding Schemes
Relative encodings encompass a broad spectrum of parameterizations:
- Additive bias tables: Explicitly learn or fix a vector or scalar for each relative offset (Shaw et al. (Huang et al., 2020), T5, ALiBi (Li, 5 Jun 2025, He et al., 2024), KERPLE (Zhang et al., 2024)).
- Multiplicative/group action schemes: Rotate or otherwise transform query and key vectors by group elements parameterized by position (RoPE (Gu et al., 19 May 2025), LieRE (Ostmeier et al., 2024), GRAPE-M (Zhang et al., 8 Dec 2025)).
- Low-rank or kernel approximations for linear Transformers: Realize RPEs via product of position-dependent feature maps, e.g., stochastic positional encoding (SPE) (Liutkus et al., 2021), linearized RPE (LRPE) (Qin et al., 2023).
- Toeplitz or spectral bias kernels: Model as a function inducing a Toeplitz matrix and study its spectral properties for expressivity/stability (Gu et al., 19 May 2025).
- Geometric/graph encodings: Use graph-theoretic distances, resistance, magnetic Laplacian embeddings, or camera frustum relations for relative encoding over more general data domains (Black et al., 2024, Huang et al., 2024, Li et al., 14 Jul 2025).
- Binary codes and quantized schemes for spiking/binary SNNs: Employ Gray code–based or logarithmic code RPEs compatible with binary network operations (Lv et al., 28 Jan 2025).
- Hybrid approaches: Blend segment-level absolute and inter-segment relative encoding (BiPE (He et al., 2024)), or mix identity and rotary streams as in MLA (Gu et al., 19 May 2025).
A technical table below summarizes core classes:
| Scheme | Parameterization Domain | Notable Property |
|---|---|---|
| Additive bias (Shaw/ALiBi) | or per offset | Linear/learned bias, streaming-friendly |
| Multiplicative group (RoPE, LieRE, GRAPE) | Lie group (SO(d)), matrices | High-variety, group-theoretic compositionality |
| Kernel/spectral (Toeplitz, wavelet) | Operators, Toeplitz/spectral | Spectral contraction, condition number control |
| Linearized (SPE, LRPE) | Low-rank product space | compute and memory |
| Geometric/graph-wise | Graph metrics, , projective groups | Data-dependent invariants, heterogeneous domains |
3. Theoretical Properties: Expressivity, Generalization, Extrapolation
Relative positional encodings fundamentally alter the expressivity and generalization properties of transformer models. The expressivity of an RPE-equipped transformer is characterized by its ability to represent invariant functions of input that depend only on pairwise or groupwise relative positional relations, not absolute sequence position. Formal results articulate that:
- The class of functions implementable by an -layer, -head transformer with relative PE is dense in the set of continuous functions invariant under position translation (Li, 5 Jun 2025).
- Pure RPE models cannot represent tasks that require absolute position awareness (e.g., “is token in the first 10%?”), while any function of relative offset is fully expressible (Li, 5 Jun 2025).
- For generalization, Rademacher complexity bounds show that regularization/decay on the parameter norms of the offset bias or generator controls overfitting and yields uniform convergence rates comparable to classic absolute encodings (Li, 5 Jun 2025).
- Extrapolation to sequence lengths is strictly impossible for learned absolute and compactly parameterized RPEs (with a hard cutoff/clipping at ), but is well-supported by ALiBi (linear slope), HyPE (hyperbolic) (Angelotti, 2023), and kernelic/spectral bases (Li, 5 Jun 2025, Gu et al., 19 May 2025), as their functional forms remain well-defined for arbitrary (He et al., 2024).
4. Implementation Methodologies in Transformer Architectures
RPEs are integrated into Transformers at various algorithmic locations:
- Key-query logit modification: Augment the dot-product with or more generally . This is directly implemented in the additive and kernel/spectral schemes (Huang et al., 2020, Huang et al., 2024).
- Query/key transformation: Pre-multiply by as in RoPE, LieRE, GRAPE, so that depends on (Ostmeier et al., 2024, Zhang et al., 8 Dec 2025, Gu et al., 19 May 2025).
- Attention kernel design for linear transformers: Constrain to have a decomposition , which permits low-memory, fast-transform methods (Liutkus et al., 2021, Qin et al., 2023).
- Graph Transformer bias injection: For general data, a precomputed matrix where is shortest path or other graph metric, is added to the attention logits (Black et al., 2024).
- Specialized domain encodings: In computer vision or 3D perception, geometric relation matrices (projective, , etc.) transform token features or attention kernels to render them invariant to camera pose, scene rearrangement, or projection (Li et al., 14 Jul 2025).
Efficient implementations exploit blockwise decomposition (e.g., blocks in rotary/GRAPE), precompute transformations for repeated positions, use streaming caches for autoregression, and exploit structure (Toeplitz, group action) for batch suitability.
5. Empirical Evaluation and Domain-Specific Impact
Relative positional encoding schemes achieve state-of-the-art results across a spectrum of tasks and modalities:
- Vision: LieRE outperforms RoPE-Mixed and absolute positional encoding on CIFAR100 (+2.7% over RoPE-Mixed, +5.5% over absolute) and UCF101/RSNA/3D tasks (+2.5% to +6.7% gains), with enhanced data/compute efficiency and robustness to patch shuffling (Ostmeier et al., 2024).
- Speech and audio: T5-style and kernel RPEs (KERPLE) yield persistent improvements in PESQ/ESTOI across SNRs and outperform both fixed and learned absolute embeddings in noncausal Transformers (Zhang et al., 2024). Relative encoding in speech Transformers boosts WER/BLEU robustness under segmentation quality shifts, exceeding absolute PE in both ASR and speech-to-text translation (Pham et al., 2020).
- Language modeling and QA: RPEs generalizing both query- and key-relative interactions (including full three-way dot product forms) deliver up to 2 F1 improvements on SQuAD and maintain accuracy at extended sequence lengths (Huang et al., 2020). Wavelet-based RPEs have leading extrapolation on synthetic long-context tasks (Li, 5 Jun 2025).
- Graph learning: Multi- Magnetic Laplacian PEs in directed graphs permit full recovery of walk profiles; empirically, they reduce RMSE in distance/walk-prediction by up to 70% over Laplacian or SVD-based encodings, and substantially outperform single- and random walk baselines on circuit, sorting, and synthetic tasks (Huang et al., 2024).
- Multi-view vision: Projective Positional Encoding (PRoPE), invariant to both SE(3) and camera intrinsics, yields highest PSNR/LPIPS/SSIM on novel view synthesis, maintains robustness under out-of-distribution focal lengths, and improves discriminative spatial cognition in geometric tasks (Li et al., 14 Jul 2025).
6. Design Insights, Limitations, and Future Directions
The structure and choice of relative positional encoding are shaped by practical, architectural, and theoretical considerations:
- Spectral contraction: Multiplicative Toeplitz-based schemes (e.g., RoPE, LieRE, GRAPE) produce better-conditioned logit matrices than additive-only or bias-based designs, which accelerates and stabilizes optimization (Gu et al., 19 May 2025).
- Group-theoretic structure: Encodings founded on Lie group actions guarantee exact or asymptotic dependence of attention solely on relative position, with higher representational capacity unlocked by enriching the dimension and noncommutativity of acting groups (as in LieRE and general GRAPE).
- Regularization and expressivity: For learned RPE tables, norm regularization is vital to prevent overfitting (especially in regime of limited data/long-sequence extrapolation) (Li, 5 Jun 2025).
- Domain-specific demands: Tasks requiring precise retrieval or copying from arbitrary context make strict long-range decay in the encoding (as in ALiBi) suboptimal; high-frequency or non-decaying encodings (HoPE) can improve both extrapolation and context-awareness (Chen et al., 2024).
- Streaming and compatibility: RPEs that admit efficient incremental computation and cacheability (e.g., additive GRAPE, ALiBi) are suited to large-scale, streaming, or autoregressive deployment.
- Hybridization and composition: BiPE and related schemes that blend absolute and relative encodings (intra vs. inter-segment) empirically and theoretically offer improved performance, especially in extrapolation (He et al., 2024).
Observable limitations include the inability of RPEs to model purely absolute position-dependent phenomena, finite-capacity learned tables failing on extreme extrapolation, and for some non-group-theoretic methods, an increase in memory or computational overhead without a corresponding gain in performance.
7. Domain Extensions and Specialized Architectures
Relative encoding principles admit direct extension to a variety of advanced settings:
- Binary and spiking networks: Gray-PE and Log-PE inject positional information compatible with binary SNN hardware, supporting time-series, text, and vision tasks with negligible computational overhead (Lv et al., 28 Jan 2025).
- Graph and geometric data: Multivariate, permutation-equivariant relative encodings via shortest path, resistance, spectral, or projective transformations permit Transformers to operate on irregular geometries, directed networks, and multi-view data, achieving close to best-known accuracies on circuit-, vision-, and program-analysis benchmarks (Huang et al., 2024, Li et al., 14 Jul 2025).
- Linear-Complexity Transformers: Structured decompositions (LRPE, SPE) allow full relative encoding even when strict complexity is imposed, as needed for single-pass multi-thousand token contexts (Liutkus et al., 2021, Qin et al., 2023).
The design space for RPEs is thus substantial and expanding, with active research mapping the trade-offs between expressivity, computation, extrapolation, and robustness required by evolving applications.
References
- (Ostmeier et al., 2024) LieRE: Lie Rotational Positional Encodings
- (Zhang et al., 8 Dec 2025) Group Representational Position Encoding
- (Gu et al., 19 May 2025) Unpacking Positional Encoding in Transformers: A Spectral Analysis of Content-Position Coupling
- (Li, 5 Jun 2025) Theoretical Analysis of Positional Encodings in Transformer Models
- (Huang et al., 2020) Improve Transformer Models with Better Relative Position Embeddings
- (He et al., 2024) Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation
- (Pham et al., 2020) Relative Positional Encoding for Speech Recognition and Direct Translation
- (Zhang et al., 2024) An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement
- (Lv et al., 28 Jan 2025) Toward Relative Positional Encoding in Spiking Transformers
- (Chen et al., 2024) HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation
- (Qin et al., 2023) Linearized Relative Positional Encoding
- (Liutkus et al., 2021) Relative Positional Encoding for Transformers with Linear Complexity
- (Li et al., 14 Jul 2025) Cameras as Relative Positional Encoding
- (Huang et al., 2024) What Are Good Positional Encodings for Directed Graphs?
- (Black et al., 2024) Comparing Graph Transformers via Positional Encodings