Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exact Positional Embeddings (ExPE) in Transformers

Updated 6 April 2026
  • Exact Positional Embeddings (ExPE) are a position encoding method that replaces the first embedding dimensions with a linear function, ensuring precise and monotonic positional signals.
  • The approach uses a fixed linear scheme (pâ‚™ = S + θ·n) without additional trainable parameters, allowing the transformer to extrapolate reliably beyond training sequence lengths.
  • Empirical evaluations demonstrate ExPE’s effectiveness in reducing perplexity on long-context tasks compared to traditional sinusoidal and rotary methods.

Exact Positional Embeddings (ExPE) are a position encoding mechanism for transformer models that provides exact, linearly extrapolatable position signals by explicitly overriding a fixed subset of each token’s embedding dimensions with a simple linear function of the absolute token position. ExPE was introduced as an alternative to traditional absolute and relative position encodings, offering substantial improvements in extrapolation to sequences whose length far exceeds the maximum observed during training. The approach operates without introducing any additional trainable parameters or requiring post-training modifications, and empirical results demonstrate substantial benefits in long-context language modeling tasks (Datseris et al., 23 Sep 2025).

1. Motivation and Background

Traditional absolute positional encodings, whether learned or based on fixed sinusoidal functions, map each input position pp to a vector ep∈Rde_p \in \mathbb{R}^d, but their ability to generalize is capped by the largest pp observed in the training corpus. Relative encodings, such as rotary position encodings (RoPE), focus on pairwise distances but also exhibit degradation when tasked to extrapolate to much longer contexts. ExPE addresses these liabilities by explicitly injecting numerical, position-based signals into a reserved subspace of the input embedding, using a parameter-free, linear scheme that preserves well-behaved, monotonic position values for any input length.

2. Mathematical Formulation

ExPE takes an original token embedding

x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d

and selects two key hyperparameters:

  • ll: the number of embedding dimensions allocated for position overrides, with l≪dl \ll d
  • SS and θ\theta: constants that define the start value and linear step size for the sequence of overridden values

The overridden positional vector is defined as

pn=S+θn,p_n = S + \theta n,

and the ExPE embedding at position nn replaces the first ep∈Rde_p \in \mathbb{R}^d0 coordinates of ep∈Rde_p \in \mathbb{R}^d1: ep∈Rde_p \in \mathbb{R}^d2 Typically, ep∈Rde_p \in \mathbb{R}^d3, and ep∈Rde_p \in \mathbb{R}^d4 is set as ep∈Rde_p \in \mathbb{R}^d5 where ep∈Rde_p \in \mathbb{R}^d6 is the maximum context length seen during training. For example, in small-scale settings, ep∈Rde_p \in \mathbb{R}^d7, ep∈Rde_p \in \mathbb{R}^d8, and ep∈Rde_p \in \mathbb{R}^d9. For larger models, pp0, pp1.

After this positional override, the modified embedding pp2 is used immediately before the query and key projections in every transformer attention block. The resulting queries and keys are: pp3 Crucially, the difference pp4 yields a perfectly linear encoding of relative distance, while the absolute position encoding remains valid and monotonic for arbitrary pp5.

3. Extrapolation Properties

Because each position pp6 is mapped to pp7 and directly overridden into a reserved embedding subspace, ExPE maintains consistent, unboundedly monotonic positional signals for any pp8. The transformer learns to interpret larger values in the overridden coordinates as corresponding to positions farther from the start. At inference, nothing prevents the extension of the position override trend indefinitely, allowing evaluation on sequences far longer than seen at training time. In contrast to rotary methods, which may need frequency scaling and/or fine-tuning for extrapolation, ExPE requires no retraining or additional adaptation for long-context evaluation.

4. Integration and Hyperparameter Choices

ExPE must be applied in every layer of the transformer, specifically just before each block’s query and key projections. Empirical ablation shows that inserting the override only at a single layer, or using only a single overridden dimension (pp9), severely degrades extrapolation performance. The token embedding dimensions x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d0 are fully reserved for the position override, while x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d1 retain the semantic sub-token embedding. Experiments used x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d2 and x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d3 as fixed constants—attempts to make them learnable produced unstable training unless re-initialized to fixed values and generated only marginal benefits at significant computational cost.

Model Scale x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d4 (embed size) x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d5 (override) x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d6 x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d7
Small 384 24 0 x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d8
Medium/Large variable x=(x1,x2,…,xd)∈Rdx = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d9 0 ll0

ExPE may also be applied to values, although application to queries and keys is fundamental.

5. Empirical Evaluation

Empirical analysis involved small models (35M parameters) trained with context limits of 512 tokens, and larger models (up to 342M parameters) with comparable training limits, and evaluated extrapolation at 2×, 4×, 8×, and 16× maximum training length.

Small Model (35M, trained on 512 tokens)

Model Loss@1× Loss@2× Loss@4×
Sinusoidal PE 4.00 4.75 5.64
Rotary (RoPE) 3.88 4.37 5.05
ExPE 3.93 3.87 3.88

ExPE maintains and slightly improves perplexity as context doubles or quadruples.

Medium Model (342M, 512-token context)

Model Loss@1× Loss@2× Loss@4×
LLama Medium (RoPE) 2.63 3.45 4.55
ExPE Medium 2.63 2.59 2.71

Long-scale evaluation (up to 8192 tokens, 16× training context) highlights ExPE’s ability to sustain lower loss under extrapolation compared to other methods.

On standard zero-shot benchmarks such as HellaSwag, MMLU, and ARC, ExPE matches the performance of RoPE.

All reported experiments are constrained to models of modest size (35M–342M parameters) and relatively short training contexts (512 tokens). The effects of ExPE in larger models and with longer pretraining contexts remain unexplored. No supervised or RLHF fine-tuning regime was applied; the effects of ExPE in instruction-tuning or reinforcement-learning settings is undetermined. Existing long-document benchmarks may not adequately probe long-range semantic dependencies, suggesting the need for new datasets targeting such dependencies.

Planned directions include comparison with Dual Chunk Attention, Position Interpolation, and other recent extrapolation-focused position encoding schemes. Hybrid methods that combine ExPE with relative position encodings are also considered for future study.

7. Significance and Prospective Directions

ExPE offers a parameter-free, robust approach to positional encoding for transformer models, enabling precise extrapolation to sequence lengths far exceeding those seen in training. By directly overriding a compact, fixed subspace of the embedding with a linear position signal, ExPE preserves both semantic and positional information, simplifying the architecture and avoiding the difficulties of learned or frequency-based adaptations.

A plausible implication is that ExPE can serve as a baseline for transformer extrapolation to very long contexts without architectural or training-time modifications, subject to further validation in larger models and more demanding long-range tasks. Ongoing work is expected to clarify its interaction with alternative attention mechanisms and extended benchmarks, as well as its compatibility with hybrid and hierarchical position encoding schemes (Datseris et al., 23 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Exact Positional Embeddings (ExPE).