Papers
Topics
Authors
Recent
Search
2000 character limit reached

Randomised Positional Embeddings in Transformers

Updated 6 May 2026
  • Randomised Positional Embeddings (RPE) are stochastic encoding schemes for Transformers that replace traditional fixed or learned position encodings by sampling random indices from a large coordinate pool.
  • They improve out-of-distribution generalization by ensuring comprehensive training over sequence positions and spatial configurations, thereby addressing undertraining in tail indices.
  • RPE methods have been validated across language, vision, music, and algorithmic reasoning tasks, delivering notable gains in accuracy and resolution without requiring architectural changes.

Randomised Positional Embeddings (RPE) constitute a class of positional encoding schemes for Transformer architectures that replace or augment conventional fixed, learned, or relative position encodings with stochastic or randomized procedures. The aim is to address out-of-distribution (OOD) failures during extrapolation to longer sequences or higher resolutions, enhance length or resolution generalization, and mitigate overfitting or undertraining of position embedding parameters. RPE can be constructed as order-preserving mappings from sequences to a randomized subset of a large coordinate pool; in two dimensions, RPE generalizes to spatial settings for vision backbones. These strategies include one-dimensional randomly shifted padding schemes, global randomized sampling of position indices, stochastic feature mappings for linear transformers, and multi-dimensional sampling for structured data. RPE methods now have formal theoretical motivation and empirical support across language, algorithmic reasoning, music, and vision domains.

1. Motivation and Theoretical Foundations

Conventional Transformer models most commonly employ learned or deterministic absolute position encodings, such as the additive approach

xi=Embed(ti)+PE(i),\mathbf{x}_i = \text{Embed}(t_i) + \text{PE}(i),

where PE(i)\text{PE}(i) is a fixed (sinusoidal, learned, RoPE, etc.) vector assigned to absolute index ii (Ruoss et al., 2023). However, these encodings expose several OOD pathologies:

  • In standard training regimes, the tail positions of the embedding matrix (corresponding to large ii) are rarely updated, leading to under-calibrated or untrained representations for long sequences (Tao et al., 2023).
  • At test time, sequence lengths often exceed those seen during training, placing the model in OOD positional regimes where fixed encodings are undefined, unused, or poorly calibrated. This yields severe accuracy and calibration failures, particularly in algorithmic and compositional tasks (Ruoss et al., 2023).
  • Similar mismatch arises for high-resolution vision tasks, where 2D positional indices during inference are absent from the training grid, yielding poor generalization (Liu et al., 24 Mar 2025).

Randomized Positional Embeddings are theoretically motivated by the following:

  • Distributional Coverage: By systematically randomizing positional indices within a superset, RPE ensures that all position encodings—regardless of test sequence length or spatial extent—are exercised during training, eliminating OOD activations at inference.
  • Order Preservation: RPE discards the notion of fixed absolute distances but maintains sequence order, which is sufficient for most Transformer tasks (Ruoss et al., 2023).
  • Variance-based Positional Signaling: In the absence of explicit positional embeddings, a causal Transformer layer with random weights yields a monotonic variance shrinkage in the self-attention output as a function of the position index mm (i.e., Var[om]∼1/m\mathrm{Var}[o_m]\sim 1/m), which encodes implicit positional information (2305.13571). This suggests that explicit positional encodings can often be randomized or omitted, relying on inherent architectural bias.

2. Core RPE Algorithms and Implementations

RPE instantiations appear in multiple domains and with different operational semantics. The principal one-dimensional and two-dimensional variants are:

2.1. Randomized Index Sampling

The main construction samples an ordered subset I={i1,…,in}I = \{i_1, \ldots, i_n\} without replacement from a large integer pool {1,…,L}\{1, \ldots, L\}, where sequence length n≤N<Ln \leq N < L (Ruoss et al., 2023):

  1. For each example, n∼U({1,…,N})n \sim \mathcal{U}(\{1, \ldots, N\}).
  2. PE(i)\text{PE}(i)0, where PE(i)\text{PE}(i)1 is the set of length-PE(i)\text{PE}(i)2 ordered subsets of PE(i)\text{PE}(i)3.
  3. For each token position PE(i)\text{PE}(i)4, assign PE(i)\text{PE}(i)5. This is architecture-agnostic and compatible with any base PE(i)\text{PE}(i)6 (sinusoidal, learned, RoPE, etc.).

2.2. Random Padding (Random Shift)

For extractive QA and similar settings, the "Random Padding" strategy draws a random integer shift PE(i)\text{PE}(i)7, distributing (PE(i)\text{PE}(i)8) padding tokens arbitrarily between the front and back. The model receives real tokens at all position indices over time, rebalancing gradient updates and preventing overfitting or undertraining of particular rows of the positional embedding table (Tao et al., 2023).

2.3. Multidimensional RPE (RPE-2D)

For image transformers (e.g., diffusion models), RPE-2D independently samples PE(i)\text{PE}(i)9 horizontal and ii0 vertical positions, ii1, ii2, and assigns 2D positional embeddings ii3 to patch ii4. At inference, the deterministic grid ii5 and ii6 is chosen to match required resolution, guaranteeing seen positional coordinates (Liu et al., 24 Mar 2025).

2.4. Stochastic Positional Encoding for Linear Transformers

In the context of relative position encoding for linear-complexity attention, Stochastic Positional Encoding (SPE) constructs random-feature-based approximations of the desired kernel ii7 using Monte Carlo projections of Gaussian processes, bypassing quadratic complexity (Liutkus et al., 2021).

3. Empirical Performance and Comparative Analysis

RPE methods yield consistent or state-of-the-art performance in length and resolution generalization benchmarks across modalities.

Domain Task/Benchmark RPE Gain vs. Baseline Paper
Algorithmic Reasoning 15 formal-language synthetic tasks +12% accuracy on unseen length (avg); up to +43% (Ruoss et al., 2023)
QA (text) TriviaQA-Wiki, SQuAD, HotpotQA +0.3–1.5 F1 overall; up to +3–4 F1 at sequence tail (Tao et al., 2023)
Vision (Diffusion) ImageNet resolution extrapolation SOTA FID/sFID/IS at ×1.5–2×/4× train resolution (Liu et al., 24 Mar 2025)
Music Generation Uncond. pop-piano, groove tasks Invariance and fidelity past train length/steps (Liutkus et al., 2021)

Random Padding specifically provides statistically significant gains for answers located at rear positions in extractive QA (+3–4 F1 vs. baseline) by eliminating the front/tail disparity in position embedding calibration (Tao et al., 2023). RPE-2D achieves strong performance scaling from ii8 training to ii9 and ii0 inference, outperforming YaRN-2D, NTK-2D, and interpolation/extrapolation schemes at all tested resolutions (Liu et al., 24 Mar 2025).

4. Integration with Transformer Architectures

RPE is "drop-in" for any Transformer variant employing positional encodings.

  • Additive Integration: Simply replace ii1 by ii2 in token embedding summation.
  • Attention Bias Integration (e.g., ALiBi): Replace ii3 with ii4, using the sampled indices (Ruoss et al., 2023).
  • Relative Encoding: Replace true distance ii5 by ii6 throughout relative position computation in Transformer-XL style layers.
  • Linear-complexity Attention: For Stochastic PE, random-feature maps are used to recover expected cross-covariance kernels, integrating RPE with Favor+ or ReLU-based attention mechanisms (Liutkus et al., 2021).
  • Vision: For DiT backbones, RPE-2D replaces the fixed grid with per-iteration sampled spatial indices. All standard PE forms (sinusoidal, RoPE, learned) can serve as the base (Liu et al., 24 Mar 2025).

5. Practical Considerations and Hyperparameters

RPE schemes introduce minimal engineering overhead and no architectural changes, but optimal practice involves certain choices:

  • Pool Size: ii7 for sequence models, ii8, ii9 for vision (Liu et al., 24 Mar 2025). Generalization is robust to precise mm0, but must cover target lengths/resolutions (Ruoss et al., 2023).
  • Randomization Granularity: For Random Padding, mm1 can be capped (e.g., mm2 or mm3) to stabilize special tokens like CLS.
  • Inference: RPE randomization is not applied during test/inference; deterministic canonical padding or gridding is used to maintain input/position correspondence (Tao et al., 2023, Liu et al., 24 Mar 2025).
  • Hybrid Usage: Causal LMs may omit or randomize embeddings, relying on variance shrinkage for alpha-numeric cues; bidirectional models lack this inherent signal and require explicit PE (2305.13571).
  • Efficiency: RPE schemes dramatically reduce training time for length-extrapolating models (e.g., ×35 speedup to achieve mm490% accuracy in algorithmic tasks) (Ruoss et al., 2023).
  • Data Augmentation: For RPE-2D, combined resize/crop augmentation ("Cond-Aug") encoded via micro-conditioning vectors further strengthens OOD robustness (Liu et al., 24 Mar 2025).

6. Limitations and Applicability Domain

RPE is most beneficial in scenarios where:

  • Tail indices are underexposed in training (short-sequence training, long-context evaluation).
  • Global sequence order matters more than fixed distances.
  • Target application requires extrapolation/generalization to unseen lengths or resolutions.

Limitations and constraints include:

  • For sequence classification tasks dominated by [CLS], RPE has diminished or negligible impact (Tao et al., 2023).
  • Models using purely relative or rotary PEs may gain less from RPE in settings where tail indices have already been well-trained (Tao et al., 2023).
  • For extremely large inference resolutions (mm5), RPE-2D may require larger position pools, increasing memory.
  • Injection of crop/resize micro-conditioning vectors is necessary to prevent artifacts in aggressive vision augmentation (Liu et al., 24 Mar 2025).
  • Relative and Rotary Position Encoding: While RPE preserves order and can wrap relative PE in its sampling machinery, they address complementary aspects; RPE focuses on OOD coverage, whereas relative encodings focus on translation or gap invariance (Ruoss et al., 2023, Liutkus et al., 2021).
  • Variance-based Implicit Encoding: In causal self-attention without PE, monotonic decay of self-attention output variance serves as a primitive positional signal, suggesting that explicit PE can sometimes be safely randomized or omitted (2305.13571).
  • Random Features for Linear Transformers: Stochastic PE interprets relative PE through random-feature cross-covariance approximations, enabling linear complexity attention and extending RPE concepts to fast Transformer variants (Liutkus et al., 2021).

In summary, Randomised Positional Embeddings constitute a family of flexible, architecture-agnostic, and empirically robust methods for mitigating OOD failures and enhancing length and resolution generalization. By leveraging randomization in positional index assignment while preserving order, RPE addresses critical training-update imbalances and distribution shift in both text and vision Transformer models (Tao et al., 2023, Ruoss et al., 2023, 2305.13571, Liu et al., 24 Mar 2025, Liutkus et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomised Positional Embeddings (RPE).