Absolute Positional Embeddings
- Absolute positional embeddings are unique vectors associated with each sequence position, enabling Transformers to differentiate element order.
- Variants such as learned, sinusoidal, and polynomial embeddings offer diverse trade-offs in extrapolation, efficiency, and expressiveness.
- Advanced methods like CAPE, SHAPE, and MapFormers augment embeddings with input-dependent adjustments to improve robustness and generalization.
Absolute positional embeddings encode the position of sequence elements by associating each position with a unique vector, which is typically combined (by addition or concatenation) with the content embedding prior to any further processing. These embeddings are fundamental to architectures based on self-attention, such as Transformers, which are otherwise permutation-invariant and thus unable to distinguish between different sequence orders without such positional information.
1. Mathematical Formulations and Variants
Absolute positional embeddings (APEs) come in several forms:
Learned Absolute Positional Embeddings:
A learnable matrix (with the maximum sequence length and the embedding dimension) provides vector for each discrete position . Standard practice is to sum this with the token embedding at position to produce the initial input to the Transformer stack: (Sinha et al., 2022, Ravishankar et al., 2021, Ke et al., 2020, Huang et al., 2021).
Sinusoidal Absolute Positional Embeddings:
The original Transformer employed deterministic, parameter-free sinusoids: These have desirable properties: they can be extrapolated beyond trained sequence lengths and permit representing relative distances as linear operations—a motivation discussed in depth for both monolingual and multilingual settings (Ravishankar et al., 2021, Aggarwal, 2024, Chowdhury et al., 19 Apr 2025, Likhomanenko et al., 2021).
Polynomial and Advanced Orthogonal Bases:
PoPE proposes using Legendre polynomials for each position, constructing
where is the th Legendre polynomial, and grid samples. This yields non-periodic, orthogonal positional representations with superior discrimination in high dimensions and improved neural convergence (Aggarwal, 2024).
Complex-Function and Fourier-Based Embeddings:
Alternative formulations—such as complex-valued embeddings (Wang et al., 2019) or random Fourier features in spatial settings (Zheng et al., 2023)—define positional functions or , such that each element is lifted to a location-dependent phase in the complex or Fourier basis.
Augmented and Input-Dependent Embeddings:
Recent work, including CAPE and SHAPE, augments absolute embeddings with random shifts, scaling, or data-dependent modifications. In CAPE, random global shift, per-token noise, and scaling are applied to break spurious correlations and support robust extrapolation (Likhomanenko et al., 2021). SHAPE shifts absolute position embeddings during training to encourage shift invariance (Kiyono et al., 2021). In cognitive-mapping MapFormers, absolute position vectors are updated via input-dependent rotations, disentangling "where" from "what" and enabling path integration (Rambaud et al., 24 Nov 2025).
2. Theoretical Properties and Motivations
The central motivation of absolute positional embeddings is to break the inherent permutation invariance of self-attention (Huang et al., 2021). Sinusoidal APEs' Fourier structure enables sequence position to be encoded as a combination of basis functions, allowing for generalization to unseen sequence lengths and shift-invariance via their linear recurrence relations (Ravishankar et al., 2021, Aggarwal, 2024). Learned APEs provide data-adaptive encodings, but may lose this inductivity, suffering on inputs much longer than those seen in pretraining (Sinha et al., 2022, Sinha et al., 2022, Datseris et al., 23 Sep 2025).
Advanced schemes, such as PoPE's Legendre polynomials, are non-periodic and orthogonal, preventing the highly correlated "tail" behavior of sinusoids at high embedding dimensions, which can otherwise inject spurious attention biases and hamper convergence (Aggarwal, 2024). Complex-valued and Fourier embeddings encode position as phase information, providing a continuous geometric interpretation of order (Wang et al., 2019, Zheng et al., 2023).
Structural bias, as introduced in MapFormers via input-dependent absolute embeddings, disentangles content and structural relationships, which results in more robust out-of-distribution generalization and the emergence of cognitive map-like representations (Rambaud et al., 24 Nov 2025).
3. Practical Consequences and Issues
Expressiveness and Overfitting:
Learned APEs have high expressiveness, but exhibit over-reliance on absolute indices and poor generalization under distribution shifts. Experiments show that phase-shifted input—moving the same sentence to a different absolute region—degrades zero-shot, few-shot, and fine-tuning performance across multiple model families, with loss of up to 15 points in masked-LM acceptability or prompting accuracy at large position shifts (Sinha et al., 2022). Larger models intensify this sensitivity, suggesting overparametrization exacerbates the absolute-position bias.
Compositionality and Multilinguality:
In multilingual models, learned APEs often "discover" a sinusoidal-like encoding under compression pressures, as this form enables compositional attention to relative offsets across diverse word-orders. More complex positional transforms (e.g., TUPE) lack this bias, hurting cross-lingual alignment (Ravishankar et al., 2021).
Efficiency and Extrapolation:
Learned APEs have a learnable table of size (for sequence length and embedding ), providing fast look-up but incurring memory cost and inability to extrapolate. Sinusoidal and polynomial approaches are parameter-free, enabling extrapolation but are susceptible to high-dimensional redundancy (in the sinusoidal case) or may require carefully managed polynomial order (in the polynomial case). The recently introduced ExPE method sidesteps this by overriding fixed dimensions with linearly increasing ramp functions, achieving exact extrapolation even at extreme sequence lengths with minimal parameter/filesize footprint (Datseris et al., 23 Sep 2025).
Plug-and-play and Generalizability:
Augmented methods like CAPE introduce small modulations to the positions during training and can be retrofitted to existing architectures without modifying the attention computation (Likhomanenko et al., 2021). Input-dependent absolute encodings can serve as the core for unifying absolute and relative position schemes, as seen in map-based architectures (Rambaud et al., 24 Nov 2025).
Noise, Cross-terms, and Architectural Choices:
Absolute APEs are typically added to token embeddings; this induces mixed word–position cross-terms in unnormalized attention that can inject noise, reduce model expressiveness, and limit performance. The TUPE approach separates the parameterizations for token and position embeddings, removing these cross-terms and adding expressiveness and stability, especially for special tokens such as CLS.
4. Empirical Comparisons and Benchmark Results
Quantitative comparisons highlight both the advantages and limitations of absolute embeddings:
- On the SQuAD1.1 task with BERT-Base, learned absolute APEs achieve EM 81.58 / F1 88.59, vs. 83.63 / 90.53 for advanced relative or hybrid schemes—demonstrating a 2-point F1 gap (Huang et al., 2020).
- PoPE-based (orthogonal polynomial) encoding increases BLEU from 35.59 to 40.7 over sinusoidal baseline on Multi30k EN→DE, and triples convergence speed in training (Aggarwal, 2024).
- In long-context causal LMs, ExPE preserves perplexity up to 4× the training length, outperforming RoPE and sinusoidal embeddings, which degrade notably outside the training window (Datseris et al., 23 Sep 2025).
- For point cloud processing under severe out-of-distribution noise, analytic random Fourier-based (sin-cos) absolute embeddings reduce error from 52% (learned PCT) to 22% (analytical PE) on corruption benchmarks, demonstrating significant robustness gains (Zheng et al., 2023).
- Vision tasks: LOOPE’s learnable patch-ordering for sinusoidal absolute PEs improves DeiT-base classification accuracy by 3.5% and achieves 93.7% accuracy on the Three-Cell diagnostic benchmark, revealing substantially greater sensitivity to absolute position than conventional evaluations (Chowdhury et al., 19 Apr 2025).
| Method | Monolingual PPL | Multilingual Score | BLEU (MT, EN→DE) |
|---|---|---|---|
| Learned APE | 63.44 | 68.59 | 35.59 |
| Sinusoidal APE | 63.96 | 68.95 | 35.59 |
| TUPE (abs) | 58.77 | 48.07 | 36.86 |
| PoPE | — | — | 40.70 |
| ExPE | — | — | — |
Sources: (Ravishankar et al., 2021, Aggarwal, 2024, Datseris et al., 23 Sep 2025)
5. Extensions, Augmentations, and Combinations
Augmentation Schemes:
CAPE and SHAPE augment absolute positions during training with shifts, noise, and scaling, biasing the model toward relative differences and improving generalization to different sequence/patch scales (Likhomanenko et al., 2021, Kiyono et al., 2021).
Learnable Patch/Order Linearization (Vision):
LOOPE proposes optimizing the 2D→1D mapping of image patches for absolute PEs, with empirical gains in spatial representation and monotonicity preservation (Chowdhury et al., 19 Apr 2025).
Input-Dependent and Structural Bias:
MapFormers' input-dependent absolute embeddings decouple structure ("where") from content ("what") to form cognitive maps, using rotation-based path integration and parallel cumulative sum operations (Rambaud et al., 24 Nov 2025).
Hybrid and Unified Schemes:
Hybrid mechanisms combine absolute with relative position information, or linear ramp dimensions (ExPE), orthogonal polynomials (PoPE), or use multiplicative rather than additive interaction in attention to encode richer joint dependencies (Datseris et al., 23 Sep 2025, Aggarwal, 2024, Huang et al., 2021).
6. Limitations, Open Problems, and Best Practices
Known limitations of absolute positional embeddings include:
- Poor invariance to index shifts; models overfit to the "zero-anchored" region, failing when sequence positions are shifted or extended beyond training, unless specific care is taken (Sinha et al., 2022, Datseris et al., 23 Sep 2025).
- Biases in high-dimensional representations toward spurious attention, especially with sinusoids, due to redundant or highly correlated features (Aggarwal, 2024).
- Limited relative order representation—the difference between embeddings at and does not necessarily reflect their true distance unless explicit constraints or orthogonality are imposed (Sinha et al., 2022).
- Increased parameter cost for learned absolute tables at large context windows.
Advanced variants (PoPE, ExPE, CAPE, TUPE, LOOPE, MapFormer absolute/structural embeddings) directly address these limitations, improving robustness, generalization, and interpretability, sometimes at the cost of additional computation or architectural changes (Datseris et al., 23 Sep 2025, Aggarwal, 2024, Ke et al., 2020, Likhomanenko et al., 2021, Chowdhury et al., 19 Apr 2025, Rambaud et al., 24 Nov 2025).
Best practices include:
- For multilingual or cross-domain transfer, sinusoidal or polynomial (orthogonal) absolute PEs provide superior inductive bias for compositionality and alignment (Ravishankar et al., 2021, Aggarwal, 2024).
- For long context or extrapolation, fixed, linear, or orthogonal absolute encodings (ExPE, PoPE) preserve generalization (Datseris et al., 23 Sep 2025, Aggarwal, 2024).
- For high-dimensional models, avoid sinusoids for all coordinate axes—high-frequency sinusoids can collapse to near-identity, biasing attention (Aggarwal, 2024).
- When fine-tuning on downstream tasks requiring robust position handling, especially with sequence shifts or variable context, prefer relative or hybrid encodings, or augment absolute PEs with invariance-promoting mechanisms (Sinha et al., 2022, Kiyono et al., 2021).
7. Applications Beyond NLP: Vision and Point Cloud Processing
Absolute positional embeddings underpin spatial reasoning in vision transformers (ViTs) and point cloud transformers. In vision, careful ordering and frequency selection (as in LOOPE) is essential for spatial monotonicity, additive geometric tasks, and resolving locality (Chowdhury et al., 19 Apr 2025). In 3D perception, replacing learned point encoders with analytic random Fourier-based embeddings confers dramatic robustness to out-of-distribution noise and outliers, supporting robust classification and registration (Zheng et al., 2023).
The general trend is toward analytical, parameter-free or input-dependent absolute encodings—sin-cos or polynomial functions, learnable order, or cumulative-geometric constructions—often hybridized with relative, rotational, or bias-based mechanisms for maximal robustness, generalization, and interpretability.