Randomised Positional Embeddings in Transformers

Updated 6 May 2026

Randomised Positional Embeddings (RPE) are stochastic encoding schemes for Transformers that replace traditional fixed or learned position encodings by sampling random indices from a large coordinate pool.
They improve out-of-distribution generalization by ensuring comprehensive training over sequence positions and spatial configurations, thereby addressing undertraining in tail indices.
RPE methods have been validated across language, vision, music, and algorithmic reasoning tasks, delivering notable gains in accuracy and resolution without requiring architectural changes.

Randomised Positional Embeddings (RPE) constitute a class of positional encoding schemes for Transformer architectures that replace or augment conventional fixed, learned, or relative position encodings with stochastic or randomized procedures. The aim is to address out-of-distribution (OOD) failures during extrapolation to longer sequences or higher resolutions, enhance length or resolution generalization, and mitigate overfitting or undertraining of position embedding parameters. RPE can be constructed as order-preserving mappings from sequences to a randomized subset of a large coordinate pool; in two dimensions, RPE generalizes to spatial settings for vision backbones. These strategies include one-dimensional randomly shifted padding schemes, global randomized sampling of position indices, stochastic feature mappings for linear transformers, and multi-dimensional sampling for structured data. RPE methods now have formal theoretical motivation and empirical support across language, algorithmic reasoning, music, and vision domains.

1. Motivation and Theoretical Foundations

Conventional Transformer models most commonly employ learned or deterministic absolute position encodings, such as the additive approach

$\mathbf{x}_i = \text{Embed}(t_i) + \text{PE}(i),$

where $\text{PE}(i)$ is a fixed (sinusoidal, learned, RoPE, etc.) vector assigned to absolute index $i$ (Ruoss et al., 2023). However, these encodings expose several OOD pathologies:

In standard training regimes, the tail positions of the embedding matrix (corresponding to large $i$ ) are rarely updated, leading to under-calibrated or untrained representations for long sequences (Tao et al., 2023).
At test time, sequence lengths often exceed those seen during training, placing the model in OOD positional regimes where fixed encodings are undefined, unused, or poorly calibrated. This yields severe accuracy and calibration failures, particularly in algorithmic and compositional tasks (Ruoss et al., 2023).
Similar mismatch arises for high-resolution vision tasks, where 2D positional indices during inference are absent from the training grid, yielding poor generalization (Liu et al., 24 Mar 2025).

Randomized Positional Embeddings are theoretically motivated by the following:

Distributional Coverage: By systematically randomizing positional indices within a superset, RPE ensures that all position encodings—regardless of test sequence length or spatial extent—are exercised during training, eliminating OOD activations at inference.
Order Preservation: RPE discards the notion of fixed absolute distances but maintains sequence order, which is sufficient for most Transformer tasks (Ruoss et al., 2023).
Variance-based Positional Signaling: In the absence of explicit positional embeddings, a causal Transformer layer with random weights yields a monotonic variance shrinkage in the self-attention output as a function of the position index $m$ (i.e., $\mathrm{Var}[o_m]\sim 1/m$ ), which encodes implicit positional information (2305.13571). This suggests that explicit positional encodings can often be randomized or omitted, relying on inherent architectural bias.

2. Core RPE Algorithms and Implementations

RPE instantiations appear in multiple domains and with different operational semantics. The principal one-dimensional and two-dimensional variants are:

2.1. Randomized Index Sampling

The main construction samples an ordered subset $I = \{i_1, \ldots, i_n\}$ without replacement from a large integer pool $\{1, \ldots, L\}$ , where sequence length $n \leq N < L$ (Ruoss et al., 2023):

For each example, $n \sim \mathcal{U}(\{1, \ldots, N\})$ .
$\text{PE}(i)$ 0, where $\text{PE}(i)$ 1 is the set of length- $\text{PE}(i)$ 2 ordered subsets of $\text{PE}(i)$ 3.
For each token position $\text{PE}(i)$ 4, assign $\text{PE}(i)$ 5. This is architecture-agnostic and compatible with any base $\text{PE}(i)$ 6 (sinusoidal, learned, RoPE, etc.).

2.2. Random Padding (Random Shift)

For extractive QA and similar settings, the "Random Padding" strategy draws a random integer shift $\text{PE}(i)$ 7, distributing ( $\text{PE}(i)$ 8) padding tokens arbitrarily between the front and back. The model receives real tokens at all position indices over time, rebalancing gradient updates and preventing overfitting or undertraining of particular rows of the positional embedding table (Tao et al., 2023).

2.3. Multidimensional RPE (RPE-2D)

For image transformers (e.g., diffusion models), RPE-2D independently samples $\text{PE}(i)$ 9 horizontal and $i$ 0 vertical positions, $i$ 1, $i$ 2, and assigns 2D positional embeddings $i$ 3 to patch $i$ 4. At inference, the deterministic grid $i$ 5 and $i$ 6 is chosen to match required resolution, guaranteeing seen positional coordinates (Liu et al., 24 Mar 2025).

2.4. Stochastic Positional Encoding for Linear Transformers

In the context of relative position encoding for linear-complexity attention, Stochastic Positional Encoding (SPE) constructs random-feature-based approximations of the desired kernel $i$ 7 using Monte Carlo projections of Gaussian processes, bypassing quadratic complexity (Liutkus et al., 2021).

3. Empirical Performance and Comparative Analysis

RPE methods yield consistent or state-of-the-art performance in length and resolution generalization benchmarks across modalities.

Domain	Task/Benchmark	RPE Gain vs. Baseline	Paper
Algorithmic Reasoning	15 formal-language synthetic tasks	+12% accuracy on unseen length (avg); up to +43%	(Ruoss et al., 2023)
QA (text)	TriviaQA-Wiki, SQuAD, HotpotQA	+0.3–1.5 F1 overall; up to +3–4 F1 at sequence tail	(Tao et al., 2023)
Vision (Diffusion)	ImageNet resolution extrapolation	SOTA FID/sFID/IS at ×1.5–2×/4× train resolution	(Liu et al., 24 Mar 2025)
Music Generation	Uncond. pop-piano, groove tasks	Invariance and fidelity past train length/steps	(Liutkus et al., 2021)

Random Padding specifically provides statistically significant gains for answers located at rear positions in extractive QA (+3–4 F1 vs. baseline) by eliminating the front/tail disparity in position embedding calibration (Tao et al., 2023). RPE-2D achieves strong performance scaling from $i$ 8 training to $i$ 9 and $i$ 0 inference, outperforming YaRN-2D, NTK-2D, and interpolation/extrapolation schemes at all tested resolutions (Liu et al., 24 Mar 2025).

4. Integration with Transformer Architectures

RPE is "drop-in" for any Transformer variant employing positional encodings.

Additive Integration: Simply replace $i$ 1 by $i$ 2 in token embedding summation.
Attention Bias Integration (e.g., ALiBi): Replace $i$ 3 with $i$ 4, using the sampled indices (Ruoss et al., 2023).
Relative Encoding: Replace true distance $i$ 5 by $i$ 6 throughout relative position computation in Transformer-XL style layers.
Linear-complexity Attention: For Stochastic PE, random-feature maps are used to recover expected cross-covariance kernels, integrating RPE with Favor+ or ReLU-based attention mechanisms (Liutkus et al., 2021).
Vision: For DiT backbones, RPE-2D replaces the fixed grid with per-iteration sampled spatial indices. All standard PE forms (sinusoidal, RoPE, learned) can serve as the base (Liu et al., 24 Mar 2025).

5. Practical Considerations and Hyperparameters

RPE schemes introduce minimal engineering overhead and no architectural changes, but optimal practice involves certain choices:

Pool Size: $i$ 7 for sequence models, $i$ 8, $i$ 9 for vision (Liu et al., 24 Mar 2025). Generalization is robust to precise $m$ 0, but must cover target lengths/resolutions (Ruoss et al., 2023).
Randomization Granularity: For Random Padding, $m$ 1 can be capped (e.g., $m$ 2 or $m$ 3) to stabilize special tokens like CLS.
Inference: RPE randomization is not applied during test/inference; deterministic canonical padding or gridding is used to maintain input/position correspondence (Tao et al., 2023, Liu et al., 24 Mar 2025).
Hybrid Usage: Causal LMs may omit or randomize embeddings, relying on variance shrinkage for alpha-numeric cues; bidirectional models lack this inherent signal and require explicit PE (2305.13571).
Efficiency: RPE schemes dramatically reduce training time for length-extrapolating models (e.g., ×35 speedup to achieve $m$ 490% accuracy in algorithmic tasks) (Ruoss et al., 2023).
Data Augmentation: For RPE-2D, combined resize/crop augmentation ("Cond-Aug") encoded via micro-conditioning vectors further strengthens OOD robustness (Liu et al., 24 Mar 2025).

6. Limitations and Applicability Domain

RPE is most beneficial in scenarios where:

Tail indices are underexposed in training (short-sequence training, long-context evaluation).
Global sequence order matters more than fixed distances.
Target application requires extrapolation/generalization to unseen lengths or resolutions.

Limitations and constraints include:

For sequence classification tasks dominated by [CLS], RPE has diminished or negligible impact (Tao et al., 2023).
Models using purely relative or rotary PEs may gain less from RPE in settings where tail indices have already been well-trained (Tao et al., 2023).
For extremely large inference resolutions ( $m$ 5), RPE-2D may require larger position pools, increasing memory.
Injection of crop/resize micro-conditioning vectors is necessary to prevent artifacts in aggressive vision augmentation (Liu et al., 24 Mar 2025).

Relative and Rotary Position Encoding: While RPE preserves order and can wrap relative PE in its sampling machinery, they address complementary aspects; RPE focuses on OOD coverage, whereas relative encodings focus on translation or gap invariance (Ruoss et al., 2023, Liutkus et al., 2021).
Variance-based Implicit Encoding: In causal self-attention without PE, monotonic decay of self-attention output variance serves as a primitive positional signal, suggesting that explicit PE can sometimes be safely randomized or omitted (2305.13571).
Random Features for Linear Transformers: Stochastic PE interprets relative PE through random-feature cross-covariance approximations, enabling linear complexity attention and extending RPE concepts to fast Transformer variants (Liutkus et al., 2021).

In summary, Randomised Positional Embeddings constitute a family of flexible, architecture-agnostic, and empirically robust methods for mitigating OOD failures and enhancing length and resolution generalization. By leveraging randomization in positional index assignment while preserving order, RPE addresses critical training-update imbalances and distribution shift in both text and vision Transformer models (Tao et al., 2023, Ruoss et al., 2023, 2305.13571, Liu et al., 24 Mar 2025, Liutkus et al., 2021).

Markdown Report Issue Upgrade to Chat

References (5)

Randomized Positional Encodings Boost Length Generalization of Transformers (2023)

A Frustratingly Easy Improvement for Position Embeddings via Random Padding (2023)

Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings (2025)

Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings (2023)

Relative Positional Encoding for Transformers with Linear Complexity (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomised Positional Embeddings (RPE).

Randomised Positional Embeddings in Transformers

1. Motivation and Theoretical Foundations

2. Core RPE Algorithms and Implementations

2.1. Randomized Index Sampling

2.2. Random Padding (Random Shift)

2.3. Multidimensional RPE (RPE-2D)

2.4. Stochastic Positional Encoding for Linear Transformers

3. Empirical Performance and Comparative Analysis

4. Integration with Transformer Architectures

5. Practical Considerations and Hyperparameters

6. Limitations and Applicability Domain

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Randomised Positional Embeddings in Transformers

1. Motivation and Theoretical Foundations

2. Core RPE Algorithms and Implementations

2.1. Randomized Index Sampling

2.2. Random Padding (Random Shift)

2.3. Multidimensional RPE (RPE-2D)

2.4. Stochastic Positional Encoding for Linear Transformers

3. Empirical Performance and Comparative Analysis

4. Integration with Transformer Architectures

5. Practical Considerations and Hyperparameters

6. Limitations and Applicability Domain

7. Connections to Related Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research