Positional Encoding Schemes

Updated 31 December 2025

Positional encoding schemes are methods that inject position-awareness into transformer, sequence, and graph neural network architectures to overcome permutation invariance.
They encompass techniques from fixed sinusoidal and learned absolute encodings to advanced relative, orthogonal, kernel, and group-theoretic approaches to boost model expressiveness and scalability.
These schemes underpin model optimization, generalization, and extrapolation, supporting diverse applications such as machine translation, image recognition, and graph analysis.

A positional encoding scheme is a function or mechanism enabling transformer-based architectures, neural sequence models, and graph neural networks to inject position-awareness into context- or structure-agnostic computations such as self-attention or message passing. The primary goal is to resolve the inherent permutation invariance of these models, providing a means to encode sequence order, geometric spatial relations, or structural identity. Over the last decade, positional encoding design has expanded from fixed sinusoidal or learned absolute-index tables to a diverse taxonomy of relative, orthogonal, polynomial, kernel, spectral, and topological embeddings, each tailored to model expressiveness, generalization, scalability, adaptation to new input domains, and extrapolation capacity.

1. Foundational Schemes: Absolute, Relative, and Rotary Approaches

Early transformer models utilized absolute positional encodings (APE), most notably the sinusoidal construction of Vaswani et al., in which each position $m$ in a sequence of length $L$ is mapped to a $d$ -dimensional vector: $PE_{\sin}(m,2i) = \sin\left( m/10000^{2i/d} \right),\quad PE_{\sin}(m,2i+1) = \cos\left( m/10000^{2i/d} \right)$ Learned absolute embeddings simply store a table $P \in \mathbb{R}^{L \times d}$ and are optimized end-to-end.

Relative positional encodings (RPEs) introduce distance-based bias or modulation in self-attention, commonly augmenting attention logits between positions $i$ and $j$ as a function of $|i-j|$ . RPE variants include T5's bucketed bias (learned scalar bias table indexed by distance)(Kazemnejad et al., 2023), sinusoidal relative basis(Pham et al., 2020), and content-modulated terms as in Shaw et al.

Rotary Positional Encoding (RoPE) applies position-dependent rotations in embedding space. For queries $q$ , keys $k$ at positions $i$ , $j$ : $q_i \gets R(i\theta)q_i,\quad k_j \gets R(j\theta)k_j,\quad \langle q_i, k_j \rangle \sim q_i^T R((j-i)\theta)k_j$ RoPE enables exact compositional relative-position invariance and efficient streaming. However, it reproduces the high-correlation pathology of sinusoids at high dimensions, as recent theoretical analysis demonstrates(Aggarwal, 2024, Gu et al., 19 May 2025).

2. Orthogonal Polynomial and Function-Based Positional Encodings

Several recent schemes address high-dimension degeneracy and periodic collapse by leveraging orthogonal polynomial or function bases.

Legendre Polynomial-Based Encoding (PoPE)(Aggarwal, 2024, Li, 5 Jun 2025) constructs a $d$ -dimensional position embedding via the Legendre polynomials $P_n(x)$ :

For absolute position $p$ mapped to $x_p \in [-1,1]$ ,
$PoPE(p) = [P_0(x_p), P_1(x_p), ..., P_{d-1}(x_p)]$ . Properties:
Non-periodic, strictly orthogonal in $(-1,1)$ .
Correlation between PoPE vectors decays with positional distance.
Three-term recurrence enables direct embedding of relative offsets. Empirical results establish new BLEU baselines and 2–3× accelerated convergence on Multi30K EN–DE MT.

Wavelet-based PEs stack multi-resolution orthonormal basis evaluations, enabling exponential decay of embeddings for positions beyond the training window, yielding unparalleled extrapolation(Li, 5 Jun 2025). DFT-based 'faithful' encoding leverages uniform spectral coverage and invertibility, systematically improving time-series classification under sequence shifts(Idé et al., 2024).

3. New Periodic Families and Functional Generalizations

Expanding on periodic basis encoding, alternative functional forms have been empirically evaluated:

Triangular, Square, and Sawtooth Waves(Lopez-Rubio et al., 22 Dec 2025): Defined as $2\pi$ -periodic basis functions with controlled smoothness or quantization; triangular/sawtooth offer uniform output distributions and faster, more stable convergence than classical sinusoids. Square wave encodes coarse chunking, which can be beneficial in segmental or hierarchical models.

Encoding	Smoothness	Distribution	BLEU-4 (Multi30K)
Sinusoidal	$C^\infty$	Extremal peaks	29.48
Triangular	Continuous ( $C^0$ )	Uniform	40.68
Sawtooth	Continuous ( $C^0$ )	Uniform	40.77
Square	Discontinuous	Bucketing	34.54

4. Advances in Multiscale, Kernel, and Spline Spatial Encoding

Grid-cell inspired encodings (GridPE) model multi-dimensional spatial location as a sum of random Fourier features, with the population code constructed as

$g(x) = \sum_{i=1}^{n} c_i \, e^{j k_i^T x}$

GridPE representations are provably translationally invariant and kernel shift-invariant, supporting generalization to arbitrary $p$ -dimensional spaces. The optimal module scale ratio is $r = e^{1/p}$ , minimizing neuron count under coverage constraints. In PVT models for image recognition, GridPE delivers superior top-5 accuracy and reliable spatial-distance decay in embedding inner products(Li et al., 2024).

Spline Positional Encoding (Spe) uses B-spline basis functions with learnable knot weights, locally supporting geometric detail and enabling multi-scale refinement. When projected along randomly selected axes and summed, Spe achieves top performance in 3D shape recovery and image regression tasks, outperforming Fourier features and sinusoidal encodings especially at high spatial resolution(Wang et al., 2021).

5. Hierarchical, Fractional, and Sequential Positional Encoding Variants

Fractional Positional Encoding (FPE) is designed for insertion transformers and parallel decoding schemes. It eschews any global index, instead interpolating fixed embeddings of neighboring tokens at insert time, maintaining cacheability and fast batched inference(Zhang et al., 2021). Each token's positional vector is a function $f(p_L, p_R)$ of its left/right neighbors, never changing after placement.

Bilevel Positional Encoding (BiPE) introduces a split modeling: intra-segment (absolute encoding within a segment) and inter-segment (relative encoding between segments). ALiBi-style biases or RoPE-style per-segment rotations supply the inter-segment signal. BiPE enables both strong extrapolation and parameter efficiency in automata emulation tasks and delivers state-of-the-art perplexity and downstream scores on long-context language and summarization tasks(He et al., 2024).

SeqPE encodes each $n$ -dimensional index as a symbolic digit sequence processed by a lightweight sequential encoder (typically a small Transformer), regularized by contrastive alignment to a chosen position-distance function and distillation to anchor out-of-distribution contexts(Li et al., 16 Jun 2025). SeqPE achieves near-optimal extrapolation and multi-modal extension to vision data, with logarithmic parameter growth.

6. Group-Theoretic, Spectral, and Graph-Based Positional Encoding

The GRAPE framework (Multiplicative and Additive Group Representational PE)(Zhang et al., 8 Dec 2025) unifies and extends RoPE and ALiBi via group actions:

Multiplicative: Rotations in $SO(d)$ via exponentials of skew-symmetric generators, supporting block-diagonal or non-commuting subspaces for cross-feature coupling.
Additive: Unipotent actions in $GL(d+1)$ yielding linear biases as in ALiBi, FoX, and path-integral recency-gating. These generalized group-theoretic perspectives guarantee exact compositional/streaming laws and enable direct control over expressiveness and dynamic recency scaling.

Spectral analysis of content-position coupling demonstrates that multiplicative Toeplitz modulation as in RoPE contracts the attention logit spectral norm, accelerating optimization and specializing early-layer heads for positional signal(Gu et al., 19 May 2025). Controlled mixing with absolute PE (MLA) diffuses this concentration, ameliorating length-generalization deficits.

In Graph Neural Networks, positional encoding includes Laplacian eigenvector (LPE), $p$ -Laplacian, and random feature propagation (RFP)(Maskey et al., 2022, Eliasof et al., 2023, 2502.01122). Learnable encoding (PEARL, PiPE)(2502.01122, Verma et al., 6 Jun 2025) employs GNN-generated features from random or basis initializations, statistical pooling, and persistent homology augmentations for expressiveness beyond classical 1-WL. Persistent-informed PIPE is a provably strictly more expressive scheme for molecular and OOD graph classification.

7. Expressiveness, Generalization, and Extrapolation Properties

Expressiveness of positional encoding schemes (measured via universal approximation, WL graph distinguishability, and sequence mapping capacity) is deeply tied to the choice of basis and encoding modality(Li, 5 Jun 2025):

Sinusoidal and learned absolute PEs are injective up to period and training length, but periodic collapse and out-of-range index saturation undermine extrapolation.
Relative PEs provide robust generalization up to a defined offset but require careful bias or clipping management.
ALiBi exhibits strict linear extrapolation, tunable via its decay parameter.
Orthogonal (Legendre, wavelet) and kernel-based schemes yield strong multi-scale expressiveness and decay-based extrapolation, significantly outperforming sinusoids in synthetic tasks.
Group-theoretic and spectral encodings unify compositional laws, with mixing controlling optimization efficiency and concentration.

Scheme	Generalization	Extrapolation	Expressiveness
Sinusoidal APE	Moderate	Poor (periodicity)	Universal (on train-length)
Learned APE	Poor/OOD-unsafe	None	High (on train)
Relative PE	Strong (bucketed)	Clipped	Limited by bucket size
ALiBi	Linear decay	Robust linear	Good (monotonic tasks)
Wavelet/Legendre	Excellent	Strong (decay)	High multi-scale
Group-theoretic	Tunable	Tunable	Highest (via subgroup choice)

References

(Aggarwal, 2024) PoPE: Legendre Orthogonal Polynomials Based Position Encoding for LLMs
(Li, 5 Jun 2025) Theoretical Analysis of Positional Encodings in Transformer Models: Impact on Expressiveness and Generalization
(Gu et al., 19 May 2025) Unpacking Positional Encoding in Transformers: A Spectral Analysis of Content-Position Coupling
(Zhang et al., 8 Dec 2025) Group Representational Position Encoding
(He et al., 2024) Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation
(Lopez-Rubio et al., 22 Dec 2025) Alternative positional encoding functions for neural transformers
(Li et al., 2024) GridPE: Unifying Positional Encoding in Transformers with a Grid Cell-Inspired Framework
(Zhang et al., 2021) Towards More Efficient Insertion Transformer with Fractional Positional Encoding
(Li et al., 16 Jun 2025) SeqPE: Transformer with Sequential Position Encoding
(Eliasof et al., 2023) Graph Positional Encoding via Random Feature Propagation
(Idé et al., 2024) Improving Transformers using Faithful Positional Encoding
(Maskey et al., 2022) Generalized Laplacian Positional Encoding for Graph Representation Learning
(2502.01122) Learning Efficient Positional Encodings with Graph Neural Networks
(Wang et al., 2021) Spline Positional Encoding for Learning 3D Implicit Signed Distance Fields
(Verma et al., 6 Jun 2025) Positional Encoding meets Persistent Homology on Graphs

Positional encoding schemes have evolved into a rigorous area of model architecture, encompassing a wide spectrum of basis functions, algebraic group actions, topological features, and learnable mappings. Design choices in positional encoding underpin model expressiveness, optimization dynamics, adaptation, and extrapolation—determining the effectiveness of transformers and GNNs across domains. Orthogonal polynomial, group-theoretic, and multi-scale kernel encodings constitute the current frontier, with empirical evidence demonstrating performance and stability gains in both standard and out-of-distribution tasks.

Markdown Upgrade to Chat

References (17)

The Impact of Positional Encoding on Length Generalization in Transformers (2023)

Relative Positional Encoding for Speech Recognition and Direct Translation (2020)

PoPE: Legendre Orthogonal Polynomials Based Position Encoding for Large Language Models (2024)

Unpacking Positional Encoding in Transformers: A Spectral Analysis of Content-Position Coupling (2025)

Theoretical Analysis of Positional Encodings in Transformer Models: Impact on Expressiveness and Generalization (2025)

Improving Transformers using Faithful Positional Encoding (2024)

Alternative positional encoding functions for neural transformers (2025)

GridPE: Unifying Positional Encoding in Transformers with a Grid Cell-Inspired Framework (2024)

Spline Positional Encoding for Learning 3D Implicit Signed Distance Fields (2021)

10.

Towards More Efficient Insertion Transformer with Fractional Positional Encoding (2021)

11.

Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation (2024)

12.

SeqPE: Transformer with Sequential Position Encoding (2025)

13.

Group Representational Position Encoding (2025)

14.

Generalized Laplacian Positional Encoding for Graph Representation Learning (2022)

15.

Graph Positional Encoding via Random Feature Propagation (2023)

16.

Learning Efficient Positional Encodings with Graph Neural Networks (2025)

17.

Positional Encoding meets Persistent Homology on Graphs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Positional Encoding Schemes.