Relative Position Representations
- Relative position representations are encoding methods that capture spatial, sequential, or graph-based relationships between tokens or nodes.
- They employ mechanisms like additive biases, rotary embeddings, and multiplicative gates within self-attention and message-passing frameworks to improve model generalization.
- Empirical results demonstrate improved performance in NLP, vision, and 3D tasks by enhancing translation invariance and structural awareness.
Relative position representations are methods for encoding the positional relationships between tokens, objects, or nodes in models based on self-attention or message-passing, such that pairwise or structural spatial information directly modulates the computation of model outputs. These representations allow models to be equivariant or sensitive to translational, sequential, or graph-theoretic relationships, conferring improved generalization, translation invariance, and structural awareness over purely absolute positional encodings. Relative positional encoding is foundational in advanced Transformers, graph neural networks, and spatial reasoning architectures across language, vision, and multimodal domains.
1. Foundations: Absolute vs. Relative Position Encodings
Classic Transformer architectures as introduced in Vaswani et al. (2017) encode positional information via absolute encodings—injecting fixed or learned vectors denoting sequence position into token embeddings. This mechanism breaks the permutation equivariance of attention but does not directly model pairwise distances. Relative position representations, in contrast, provide a mechanism by which model computations depend on (and are typically parametrized by) the distance or relational structure between elements, not just their absolute index (Shaw et al., 2018).
Relative position encodings can be additive (e.g., biasing attention logits with learned vectors indexed by ) or more structurally embedded, as in rotary-encoded dot products, multiplicative gates, or deep polynomial interactions (Huang et al., 2020, Ostmeier et al., 2024). These mechanisms have been shown empirically to improve generalization, especially to data outside the training domain, e.g., longer sequences, larger image patches, or greater spatial deformations (Chen, 2021, Kim et al., 2024). In graph transformers, relative position encodings are required to capture the underlying relational topology (Black et al., 2024).
2. Canonical Relative Position Mechanisms
The primary mechanisms for relative position representations include:
- Shaw-style additive RPE: The seminal method of (Shaw et al., 2018), which incorporates learned relative embeddings into the self-attention computation. For sequence positions , a clipped signed offset indexes a learned embedding . For each attention head, the score is:
and the aggregated output incorporates an analogous value-side bias.
- Rotary Positional Embedding (RoPE): Introduced to encode position via phase rotations in 2D planes within the embedding space. For token position , the block-diagonal matrix rotates key and query vectors such that the inner product depends only on . RoPE's extension to spherical and high-dimensional rotations enables encoding positions on spheres or volumetric grids (Unlu, 2023, Ostmeier et al., 2024).
- Advanced multiplicative/gated RPE: Recent methods generalize beyond purely additive forms, introducing multiplicative gates, triple-products, or full three-way dot products between , enabling more expressive interactions (Huang et al., 2020).
- Permutation-based and anchor-based methods: In long-sequence or graph settings, relative positions may be encoded by compositional permutations or via learned/sampled anchor points whose shortest-path distances to nodes form relative position descriptors (Chen, 2021, Qin et al., 2021).
- Pairwise continuous or off-grid representations: Some visual and geometric models replace grid-based encodings with representations derived from continuous relative translations, scales, or landmark offsets, as in PART for images and KP-RPE for vision transformers (Ayoughi et al., 4 Jun 2025, Kim et al., 2024).
3. Mathematical Formalism and Varieties
Relative position representations span several domains, with formal instantiations including:
- Sequences: Positions are indexed by integer offsets , with learned or parameterized tables per offset up to a clipping distance (Shaw et al., 2018, Huang et al., 2020).
- Grids and Spheres: 2D/3D inputs use either discretized displacement vectors or parametric functions mapping or spherical angles to rotation or embedding matrices (Unlu, 2023, Ostmeier et al., 2024).
- Graphs: Pairwise node relationships are represented by shortest path distance, resistance distance, or spectral kernels. RPEs here are functions preserved under graph isomorphism (Black et al., 2024). PSGNN uses learned anchors with relative distances factored by a non-linear embedding (Qin et al., 2021).
- Structural (syntactic/code) positions: In code summarization and NLP, relative positions derive from structural relations in the underlying tree (e.g., AST or dependency tree), with edge embeddings parameterized by structural path length (Gong et al., 2022, Wang et al., 2019).
These mechanisms are formalized in multiple mathematical forms:
- Additive: ;
- Multiplicative: ;
- Rotational: ;
- Full interaction: (Huang et al., 2020, Unlu, 2023, Ostmeier et al., 2024).
4. Applications in Language, Vision, and Graphs
Relative position representations are adopted in a range of domains:
- Language Modeling and Translation: RPEs yield consistent BLEU gains in translation benchmarks. Predicting pairwise relative positions also enables dense self-supervised objectives, providing label-rich pretraining (Shaw et al., 2018, Brüel-Gabrielsson et al., 2022).
- Vision Transformers: In ViT architectures, relative and keypoint-anchored RPEs impart robustness to misalignments and geometric transformations, improving unaligned face identification and gait recognition (Kim et al., 2024). Off-grid or continuous relative encodings improve spatial precision in detection and temporal modeling (Ayoughi et al., 4 Jun 2025).
- Graph Transformers and GNNs: Relative structural embeddings derived from graph distances, resistance, or spectral transforms are essential for breaking node-permutation symmetry and delivering expressive, topology-aware representations (Black et al., 2024, Qin et al., 2021).
- 3D Geometric Reasoning: In 3D vision tasks, relative position-aware attention across object pairs enables accurate localization and relational reasoning, as in 3DRP-Net for 3D visual grounding (Wang et al., 2023).
5. Theoretical Properties and Inductive Generalization
Relative position encodings confer advantageous inductive properties:
- Translation invariance and extrapolation: RPEs, by construction, produce attention patterns invariant to global shifts, unlike absolute encodings, and support longer or shifted contexts without new parameters (Chen, 2021, Huang et al., 2020).
- Expressiveness in graphs: On graphs, the theoretical power of RPE-augmented models equals that of APE-augmented transformers under mild conditions, though combinatorially-aware RPEs (such as shortest-path distance) can strictly refine the Weisfeiler–Leman test, surpassing ordinary message passing in graph isomorphism distinguishing (Black et al., 2024).
- Capacity and generality: Generalizations to high-dimensional and manifold-aware rotations (LieRE, spherical RoPE) enable direct preservation of geodesic distances and support large-scale, modality-agnostic applications (Unlu, 2023, Ostmeier et al., 2024). Full three-way or polynomial interactions expand the model’s representational capacity (Huang et al., 2020, Pandya, 2022).
6. Empirical Performance and Comparative Results
Relative position representations demonstrate consistent improvements:
- NLP tasks: On SQuAD1.1, method 4 (pairwise dot-product relative encoding) yields F1=90.53 over an absolute baseline of 88.59, also matching or exceeding performance on GLUE and machine translation (Huang et al., 2020, Shaw et al., 2018).
- Vision and multimodal: LieRE achieves top-1 accuracy 69.4%/68.8% on CIFAR-100/ImageNet, outperforming RoPE variants and absolute position baselines by 1.5+% (Ostmeier et al., 2024). KP-RPE improves face verification accuracy to 93.56% on CFPFP (vs. 72.81% for vanilla ViT) (Kim et al., 2024). PART attains an improvement of 0.3–1.0 AP in COCO detection and >2% Cohen's κ in time-series classification over grid-based alternatives (Ayoughi et al., 4 Jun 2025).
- Graph and 3D grounding: PSGNNs boost AUC by 10–20% in position-aware node/link tasks (Qin et al., 2021). 3DRP-Net lifts 3D localization accuracy by 2.45–2.47 points relative to prior methods (Wang et al., 2023).
Performance generally correlates positively with the expressiveness of the relative encoding and the extent to which the downstream task rewards structural or pairwise awareness.
7. Extensions, Limitations, and Future Directions
Relative position encoding research continues to advance:
- Geometric generalization: Rotational encodings on Lie groups (LieRE), spherical and hyperbolic parametrizations, and keypoint-anchored variants extend RPEs beyond simple translational offsets to arbitrary manifolds and structured datasets (Unlu, 2023, Ostmeier et al., 2024, Kim et al., 2024).
- Scalability and complexity: Methods such as LieRE and PermuteFormer achieve O(N) scaling with respect to sequence or node count, enabling large-scale deployment without quadratic cost (Ostmeier et al., 2024, Chen, 2021).
- Universal applicability: Off-grid, pairwise, or conditional relative encodings (e.g., PART, KP-RPE) are being adapted to video, medical imaging, and non-visual time series (Ayoughi et al., 4 Jun 2025, Kim et al., 2024).
- Expressivity vs. efficiency trade-offs: Full pairwise parameterization increases model capacity at some computational cost; methods seek to balance expressiveness with head/parameter sharing and fast matrix algebra (Huang et al., 2020, Pandya, 2022).
- Theoretical unification: APEs and RPEs can be formally interconverted without loss of distinguishing power for finite graphs, so practical implementation often determines the optimal choice (Black et al., 2024).
Limitations include increased memory/compute cost for fully pairwise encodings and challenges in encoding nonrigid, non-translational relationships outside traditional settings. Ongoing work focuses on expanding efficiency, applicability, and inductive robustness.
References:
(Shaw et al., 2018, Huang et al., 2020, Chen, 2021, Unlu, 2023, Ostmeier et al., 2024, Kim et al., 2024, Ayoughi et al., 4 Jun 2025, Qin et al., 2021, Black et al., 2024, Brüel-Gabrielsson et al., 2022, Pandya, 2022, Wang et al., 2019, Gong et al., 2022, Wang et al., 2023)