Fractional Positional Encoding (FPE) in Transformers

Updated 6 January 2026

Fractional Positional Encoding (FPE) is a continuous encoding strategy that replaces integer-based positions with fixed, context-aware embeddings for dynamic sequence models.
FPE eliminates costly re-encoding by assigning fixed embeddings during token insertions, reducing computational overhead and boosting efficiency in Transformer models.
FPE enhances cross-modal synchronization by encoding true temporal information, leading to measurable improvements in tasks like video-to-text translation.

Fractional Positional Encoding (FPE) is a class of positional encoding strategies for neural sequence models, notably Transformers, that relax the reliance on integer-indexed absolute positions and instead employ either learned, continuous, or real-valued encodings. These encodings facilitate efficient representation of token or feature positions in architectures or data modalities where standard absolute or relative positional encodings are suboptimal. The two prototypical applications are insertion-based generative models, where token insertion may occur at arbitrary sequence positions, and synchronized multi-modal streams, such as audio-visual data, where disparate sampling rates confound naive position-indexing of features (Zhang et al., 2021, Harzig et al., 2021).

1. Motivation and Limitations of Standard Positional Encoding

Standard absolute positional encodings, as used in the original Transformer, assign each token in a sequence a position-indexed embedding—either via deterministic sinusoids or learned vectors. For left-to-right decoding, this strategy permits caching and reuse of hidden representations since the absolute position of any token never changes after being emitted. In contexts where generation or processing departs from strictly monotonic progression, this approach exhibits severe inefficiencies or misalignments.

In insertion-based decoding, where tokens may be inserted at arbitrary positions within a hypothesis, the assignment of integer indices to positions causes downstream tokens’ positional embeddings to change after every insertion. This necessitates complete re-encoding of all subsequent tokens with new position indices at every step, resulting in $O(n \cdot S)$ computational cost for sequences of length $n$ over $S$ steps, a regime that often negates the intended parallelism advantages of insertion-based generation (Zhang et al., 2021). In synchronized multi-modal scenarios, such as video-to-text, integer position indices fail to reflect actual time alignments between modalities with different frame rates—a shortcoming that leads to loss of temporal correspondence and suboptimal cross-modal fusion (Harzig et al., 2021).

Fractional Positional Encoding methods address these limitations by grounding position representations in continuous, context- or time-aware paradigms, circumventing the need for global realignment and enabling more robust positional semantics.

2. Formal Definitions: FPE in Insertion Transformers and Crossmodal Synchronization

Insertion-based Generation

Let $H$ denote a partially generated sequence of tokens $(w_0=\langle\mathrm{BOS}\rangle, w_1, \ldots, w_{n_t}, w_{n_t+1}=\langle\mathrm{EOS}\rangle)$ at step $t$ , each associated with positional vectors $(p_0=p_B, p_1, …, p_{n_t}, p_{n_t+1}=p_E)$ , where $p_B$ and $p_E$ are learned boundary embeddings. When inserting a new token $w_\text{new}$ between tokens at positions $n$ 0 and $n$ 1, FPE defines its positional vector as

$n$ 2

where $n$ 3 is a learned affine map,

$n$ 4

The crucial property is that $n$ 5, once assigned, remains unchanged, irrespective of future insertions. Token input embeddings to the Transformer are then $n$ 6 (Zhang et al., 2021).

Synchronized Audio-Visual Streams

For two data streams—video frames $n$ 7 and audio features $n$ 8—with respective frame counts $n$ 9 and $S$ 0 over total clip duration $S$ 1, FPE computes a real-valued position $S$ 2 for video (where $S$ 3), and $S$ 4 for audio (with $S$ 5). These real-valued times are then used in the sinusoidal positional encoding:

$S$ 6

with $S$ 7 the embedding dimension. This ensures that each modality’s feature embedding reflects its true time-of-occurrence, providing the self-attention mechanism direct access to precise inter-modal temporal relationships (Harzig et al., 2021).

3. Core Algorithms and Integration into Transformer Architectures

Insertion Transformer with FPE

The FPE insertion procedure modifies only the input-embedding stage, remaining compatible with conventional multi-head attention and transformer layers. The following pseudocode summarizes a decoding iteration (Zhang et al., 2021):

$H$ 8

This ensures that only newly inserted tokens require re-encoding, while all prior token states are cached and reused.

Audio-Visual Synchronization

The FPE approach for audio-visual fusion computes each feature’s position based on actual time, constructs the appropriate sinusoidal vectors, applies linear projections, and concatenates the sequence before Transformer encoding. Canonical pseudocode (Harzig et al., 2021):

$H$ 9 This preserves temporal relationships both across and within modalities.

4. Computational and Empirical Impact

Adopting FPE in insertion-based decoders eliminates the $S$ 8 per-step re-encoding overhead of absolute PE, reducing overall generation complexity to $S$ 9 for $H$ 0 layers, dimension $H$ 1, and $H$ 2 total tokens, matching the cost of a single left-to-right pass. For example, on WMT14 En→De (Zhang et al., 2021):

Model	FLOPs/Instance	BLEU	Latency (ms)
ABS Insertion	8.69 B	27.45	100.3
REL Insertion	4.68 B	27.40	105.0
FPE Insertion	4.65 B	27.47	97.2
L2R	—	27.72	230.1

FPE achieves comparable BLEU to absolute or relative encoding baselines while yielding a 40–50% reduction in floating-point operations and 10–20% reduction in wall-clock latency in both single-instance and batched decoding scenarios. As batch size increases, FPE maintains throughput benefits: at batch size 6K, FPE achieves 2.5 ms/instance, outperforming both REL (2.8 ms) and ABS (4.6 ms) (Zhang et al., 2021).

For video-to-text translation, introducing FPE for both video and audio streams leads to measurable improvements on the VATEX dataset: adding FPE boosts CIDEr from 56.92 to 61.80 (+8.6%), and BLEU-4 from 32.16 to 32.43. Further gains are observed after self-critical sequence training; for example, CIDEr increases from 68.62 to 70.85, and BLEU-4 from 33.92 to 36.30, providing evidence that explicit real-valued temporal encodings support more accurate multimodal alignment and sequence generation (Harzig et al., 2021).

5. Architectural and Implementation Considerations

FPE modifies only the input-encoding stage and does not necessitate changes to the self-attention kernel, transformer layer structure, or attention-relative indexing. The only new parameters are the boundary vectors ( $H$ 3, $H$ 4) and a single affine transformation $H$ 5; this overhead is negligible compared to the computational demands of multi-head attention. In insertion Transformers, this enables the reuse of cached hidden states as in left-to-right models, even when tokens are inserted mid-sequence.

Batching strategies remain unchanged: all insertions across a batch can be processed by assembling new $H$ 6 values via a matrix multiply and then encoding new tokens only, regardless of slot position or step; this allows FPE-equipped models to exploit throughput benefits of parallel token insertion (Zhang et al., 2021).

For multi-stream input, such as in video-to-text, the FPE framework provides an explicit and reliable mechanism for cross-modal synchronization at the input level, allowing transformer self-attention to operate on features with globally meaningful and modality-compatible positions (Harzig et al., 2021).

6. Empirical Advancements and Application Domains

FPE has been empirically validated across various text generation and sequence modeling domains. In insertion-based text generation, FPE restores computational efficiency, allowing insertion Transformers to match or exceed the throughput of both absolute and relative positional encoding baselines while avoiding their re-encoding cost. The empirical gains in wall-clock latency and FLOP efficiency translate into practical advantages for large-scale or real-time deployment (Zhang et al., 2021).

In video-to-text translation, integrating FPE for fine-grained synchronization of audio-visual streams leads to new state-of-the-art CIDEr and BLEU-4 scores on both the VATEX and benchmark datasets (MSR-VTT, MSVD), as well as improved robustness to unseen data (Harzig et al., 2021). Precisely encoding feature timestamps proves critical for cross-modal understanding when feature rates and counts do not match.

A plausible implication is that FPE or related continuous positional schemes will be increasingly useful in applications requiring flexible, parallel, or multi-rate sequence modeling, including non-autoregressive or mixed-modality neural architectures.

FPE is conceptually related to learned absolute positional vectors and relative positional encodings, but is distinguished by its ability to avoid index reassignment and provide re-usable, fixed embeddings in highly dynamic or asynchronized generation settings. While absolute PEs excel in simple monotonic decoders, and relative PEs encode pairwise offsets, both are ill-suited for insertion-based or time-synchronized multi-stream tasks without expensive recomputation or limited expressivity.

Research to date has primarily focused on text and video/audio domains. Future developments may involve learning nonlinear or adaptive positional functions $H$ 7, integrating uncertainty in temporal positions, or extending fractional encoding to hierarchical or continuous event domains. The principle of continuous, context-grounded position representation embodied by FPE offers a foundation for further innovations in flexible sequence modeling.

References:

"Towards More Efficient Insertion Transformer with Fractional Positional Encoding" (Zhang et al., 2021)
"Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation" (Harzig et al., 2021)

Markdown Report Issue Upgrade to Chat

References (2)

Towards More Efficient Insertion Transformer with Fractional Positional Encoding (2021)

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fractional Positional Encoding (FPE).