Reference Positional Encoding (RefPE)

Updated 7 October 2025

Reference Positional Encoding (RefPE) is a framework that generalizes classical positional encoding by leveraging learnable transformations and kernel approximations for robust position representation.
It extends traditional methods by capturing continuous, multi-dimensional spatial, sequential, or structural relationships through techniques like Fourier features, shifted basis functions, and polynomial bases.
Empirical results show that RefPE enhances convergence, generalization, and interpretability across vision, language, molecular, and graph neural network applications.

Reference Positional Encoding (RefPE) is a class of architectures and theoretical approaches for positional encoding that generalizes and extends classical positional encoding mechanisms to provide more flexible, robust, and often more interpretable positional representations in neural network models. RefPE approaches explicitly model spatial, sequential, or structural relationships in a way that enables both inductive bias and generalization across data modalities, dimensionalities, or input scales. Core to RefPE is the capacity to encode or reference explicit relationships—often leveraging kernel methods, manifold or topological anchors, or algebraic structures—yielding positional encodings that are adaptive, learnable, or theoretically grounded for the application domain.

1. Theoretical Foundations and Motivation

Attention-based deep models such as Transformers fundamentally lack a notion of position due to permutation invariance. Standard approaches to positional encoding include hard-coded sinusoidal encodings and learned positional embeddings, which primarily target linear or discrete sequences and offer limited generalization to multi-dimensional or continuous input spaces. RefPE frameworks address these limitations by providing:

Continuous, multi-dimensional positional referencing: By mapping positions $x \in \mathbb{R}^M$ into high-dimensional feature spaces via parameterized or learnable transformations, RefPE captures spatial and structural relationships beyond ordinal indices.
Shift-invariant and kernel-induced similarity: Techniques such as learnable Fourier features create encodings whose dot products approximate, for example, Gaussian kernels or other distance-based similarities, providing a natural inductive bias (e.g., encoding Euclidean proximity) (Li et al., 2021).
Higher expressivity and stability: RefPE methods often employ modulations through MLPs, polynomial bases, splines, or topological references, allowing the encoding function to be continuous, smooth, and robust to input distribution shifts (Gao et al., 2022, Aggarwal, 29 Apr 2024, Verma et al., 6 Jun 2025).
Unification of absolute and relative encoding: Theoretical insights indicate that dense, unbiased bases (e.g., Legendre polynomials, full DFT bases) allow both absolute and relative positional information to be encoded inherently, sometimes obviating the need for explicit relative encoding (Aggarwal, 29 Apr 2024, Idé et al., 15 May 2024).

2. Methodological Principles

A variety of concrete RefPE methodologies have been proposed:

Learnable Fourier Features with MLP Modulation

Given a position $x \in \mathbb{R}^M$ , the core RefPE construction is: $r_x = \frac{1}{\sqrt{D}} \left[ \cos(x W_r^T) \mathbin\Vert \sin(x W_r^T) \right]$ where $W_r \in \mathbb{R}^{D/2 \times M}$ is a learnable parameter matrix (randomly initialized, typically as $W_r \sim \mathcal{N}(0, \gamma^{-2})$ ). The representation is normalized by $1/\sqrt{D}$ .

The shift-invariant property follows from: $r_x \cdot r_y = \frac{1}{D} \sum \cos((x-y)W_r^T) \approx \exp\left(-\frac{\|x-y\|^2}{\gamma^2}\right)$ inducing a Gaussian kernel approximation.

An MLP $\varphi$ with parameters $\theta$ further modulates this embedding: $PE_x = \varphi(r_x, \theta) W_p$ where $W_p$ is a learnable projection. The MLP allows task-specific restructuring of spatial similarity, enabling learning of relationships beyond $L_2$ distance (Li et al., 2021).

Shifted Basis Function Generalization

RefPE subsumes a wide family of shifted continuous basis embedders (Zheng et al., 2021). The general map is: $\Psi(x) = \left[ \psi(0, x), \psi(s, x), ..., \psi((d-1)s, x) \right]^T$ where $\psi$ is a continuous basis function (e.g., impulse, sine, square wave, Gaussian), and $s = C/d$ is a step parameter. This allows explicit tuning of the trade-off between memorization (rank) and generalization (distance preservation), supporting deterministic or stochastic basis choices with known embedding properties.

Orthogonal and Polynomial Bases

PoPE leverages Legendre polynomials as basis functions, providing orthogonality and non-periodicity (Aggarwal, 29 Apr 2024). Encodings are

$PE_{pos, i} = P_{pos}(x_i)$

on equidistant samples $x_i \in [-1, 1]$ , producing high-dimensional representations with reduced mutual information bias versus sinusoids. Inner products of Legendre encodings offer high discrimination across both absolute and relative positions.

Regularized, Spline-based Embedding of Physical Quantities

In molecular and spatial tasks, RefPE represents continuous distances/angles as bin-based embeddings, interpolated using Cubic Hermite splines to ensure $C^1$ continuity and differentiability: $h(x) = c_1 h(\lfloor x \rfloor) + c_2 h(\lceil x \rceil) + c_3 g(\lfloor x \rfloor) + c_4 g(\lceil x \rceil)$ with smoothness regularization: $L_{\text{smooth}} = \frac{ \sum_{i=1}^{n-1} \| h(x_{i+1}) - h(x_i) \| }{ \sum_{i=1}^{n-1} \| h(x_i) \| }$ This ensures embeddings vary smoothly along physical dimensions, supporting interpretability and stable force derivation (Gao et al., 2022).

3. Architectural Integration and Comparative Analysis

RefPE in Vision and Multimodal Models

In image generation, object detection, and UI captioning, learnable Fourier-based RefPE outperforms sine-based or embedding-based approaches in terms of bits-per-dim (likelihood), average precision, and convergence speed. For instance, in Reformer on ImageNet 64×64, Learnable-Fourier+MLP achieves superior bits/dim and faster training convergence (Li et al., 2021).

RefPE in Graph Neural Networks

Graph positional encodings, when naively injected as features (e.g., via Laplacian eigenmaps), suffer from instability (orthogonal/signed ambiguity) and poor generalization. RefPE techniques such as PEG (Wang et al., 2022) decouple positional channels, maintain permutation and $O(p)$ -equivariance, and update original features through adjacency and positional distances, yielding higher link prediction accuracy and transferability across domains.

Theoretical Comparison Across Encoding Families

RefPE methods have been situated within a general spectral or kernel framework, showing that they can approximate or generalize classical encodings (sinusoidal, random Fourier, embedding-based) by virtue of basis function choice, rank, and distance preservation properties (Zheng et al., 2021, Aggarwal, 29 Apr 2024, Gu et al., 19 May 2025). Key distinctions:

Method	Generalization	Embedding Size	Extrapolation	Inductive Bias
Sinusoidal (fixed)	Limited	O(d)	Poor	Axis-independent, periodic
Learned Embedding	None	O(n)	None	Pure lookup, no inductive bias
Learnable Fourier (RefPE)	Strong	O(d)	Excellent	Holistic, distance-aware (L2)
Polynomial (PoPE)	Strong	O(d)	Excellent	Orthogonality, non-periodicity
Spline-Regularized (RefPE)	Strong	O(d)	Strong	Smooth, interpretable by physics

The advantage of RefPE is clearest in cases with shifting or unseen positions (e.g., variable image resolutions, unseen distances in molecules).

4. Empirical Performance and Application Domains

RefPE systems have demonstrated statistically significant improvements across a variety of difficult tasks:

Vision: Faster convergence and lower bits/dim in generative models; higher object detection average precision (DETR) with better adaptation to unseen image sizes (Li et al., 2021).
Language and Multimodal: Greater hit rate and lower variance in sequential recommendation with rotary-based RefPE; more stable recommendation system training (Lopez-Avila et al., 16 May 2024).
Molecular Modeling: Decreased force error in MD17 (Aspirin force error: reduced from 49.7 to 12.3 meV/Å using RefPE+smooth), and lower mean absolute error in QM9 property prediction (Gao et al., 2022).
Graph Learning: Enhanced link prediction, graph classification, and property estimation, especially in tasks requiring generalization to unseen subgraphs or domains (Wang et al., 2022, Verma et al., 6 Jun 2025).
Generalization across domains: Similar RefPE embeddings are observed across distinct, but physically related, molecular tasks, suggesting strong transferability (Gao et al., 2022).

5. Interpretability, Physics-Inspired Inductive Bias, and Robustness

Several RefPE designs yield embeddings that are physically or semantically interpretable:

Physics-based interpretation: Derivatives of the embedding with respect to physical positions (e.g., interatomic distances) reveal meaningful short-range vs long-range behavior, consistent across related molecular property prediction tasks (Gao et al., 2022).
Topology and global references: In graph learning, the incorporation of persistent homology features into RefPE (e.g., PiPE) resolves ambiguities that neither spectral nor topological features alone can distinguish, improving recognition of cycles and connectedness (Verma et al., 6 Jun 2025).
Locality and symmetry trade-offs: Analyses show current PEs are strong in locality and symmetry, but overly symmetric encoding can make models insensitive to global order changes (e.g., semantic role swaps in language); RefPE is motivated to balance locality (for composition) and controlled asymmetry (for global order sensitivity) (Chen et al., 2023).

6. Limitations, Open Problems, and Research Directions

Choice and tuning of basis functions: RefPE performance is sensitive to the choice (and learnability) of basis (Fourier, Gaussian, polynomial, etc.) and hyperparameters such as kernel width. Non-Fourier embedders may outperform Fourier features in memorization-generalization trade-offs if tuned for domain properties (Zheng et al., 2021, Aggarwal, 29 Apr 2024).
Computational efficiency at scale: Parameteric methods scale well, but learned embedding tables (discrete methods) can be prohibitive for high-resolution or variable input shapes; continuous RefPE mitigates this.
Transferability and extrapolation: Empirical results suggest that RefPE generalizes better to unseen spatial organization or sequence lengths (e.g., in molecular and vision tasks) than discrete or axis-wise concatenated encodings.
Integration with novel architectural paradigms: RefPE is compatible with both MLP-based models (for signal reconstruction) and deep, attention-based architectures (e.g., Reformer, DETR).
Physics and domain-specific encoding: The regularized, interpretable embeddings obtained with splines or topological references fortify model design with domain insights, suggesting future work at the intersection of physics, geometry, and representation learning.

7. Summary

Reference Positional Encoding (RefPE) unifies and extends positional encoding strategies for neural architectures by leveraging learnable, continuous functions, kernel approximations, and task-adaptive modulation to provide robust, expressive, and interpretable position-dependent representations. RefPE outperforms traditional positional encoding approaches by holistically capturing multi-dimensional relationships, generalizing beyond fixed input scales, and aligning with domain-specific constraints. Experimental results across vision, language, molecular, and graph-based applications substantiate its accuracy, convergence benefits, and generalization properties, conferring a solid theoretical and empirical foundation for its further adoption and development in position-sensitive machine learning architectures (Li et al., 2021, Zheng et al., 2021, Gao et al., 2022, Wang et al., 2022, Chen et al., 2023, Aggarwal, 29 Apr 2024, Verma et al., 6 Jun 2025).