Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fractional Positional Encoding (FPE) in Transformers

Updated 6 January 2026
  • Fractional Positional Encoding (FPE) is a continuous encoding strategy that replaces integer-based positions with fixed, context-aware embeddings for dynamic sequence models.
  • FPE eliminates costly re-encoding by assigning fixed embeddings during token insertions, reducing computational overhead and boosting efficiency in Transformer models.
  • FPE enhances cross-modal synchronization by encoding true temporal information, leading to measurable improvements in tasks like video-to-text translation.

Fractional Positional Encoding (FPE) is a class of positional encoding strategies for neural sequence models, notably Transformers, that relax the reliance on integer-indexed absolute positions and instead employ either learned, continuous, or real-valued encodings. These encodings facilitate efficient representation of token or feature positions in architectures or data modalities where standard absolute or relative positional encodings are suboptimal. The two prototypical applications are insertion-based generative models, where token insertion may occur at arbitrary sequence positions, and synchronized multi-modal streams, such as audio-visual data, where disparate sampling rates confound naive position-indexing of features (Zhang et al., 2021, Harzig et al., 2021).

1. Motivation and Limitations of Standard Positional Encoding

Standard absolute positional encodings, as used in the original Transformer, assign each token in a sequence a position-indexed embedding—either via deterministic sinusoids or learned vectors. For left-to-right decoding, this strategy permits caching and reuse of hidden representations since the absolute position of any token never changes after being emitted. In contexts where generation or processing departs from strictly monotonic progression, this approach exhibits severe inefficiencies or misalignments.

In insertion-based decoding, where tokens may be inserted at arbitrary positions within a hypothesis, the assignment of integer indices to positions causes downstream tokens’ positional embeddings to change after every insertion. This necessitates complete re-encoding of all subsequent tokens with new position indices at every step, resulting in O(n⋅S)O(n \cdot S) computational cost for sequences of length nn over SS steps, a regime that often negates the intended parallelism advantages of insertion-based generation (Zhang et al., 2021). In synchronized multi-modal scenarios, such as video-to-text, integer position indices fail to reflect actual time alignments between modalities with different frame rates—a shortcoming that leads to loss of temporal correspondence and suboptimal cross-modal fusion (Harzig et al., 2021).

Fractional Positional Encoding methods address these limitations by grounding position representations in continuous, context- or time-aware paradigms, circumventing the need for global realignment and enabling more robust positional semantics.

2. Formal Definitions: FPE in Insertion Transformers and Crossmodal Synchronization

Insertion-based Generation

Let HH denote a partially generated sequence of tokens (w0=⟨BOS⟩,w1,…,wnt,wnt+1=⟨EOS⟩)(w_0=\langle\mathrm{BOS}\rangle, w_1, \ldots, w_{n_t}, w_{n_t+1}=\langle\mathrm{EOS}\rangle) at step tt, each associated with positional vectors (p0=pB,p1,…,pnt,pnt+1=pE)(p_0=p_B, p_1, …, p_{n_t}, p_{n_t+1}=p_E), where pBp_B and pEp_E are learned boundary embeddings. When inserting a new token wneww_\text{new} between tokens at positions nn0 and nn1, FPE defines its positional vector as

nn2

where nn3 is a learned affine map,

nn4

The crucial property is that nn5, once assigned, remains unchanged, irrespective of future insertions. Token input embeddings to the Transformer are then nn6 (Zhang et al., 2021).

Synchronized Audio-Visual Streams

For two data streams—video frames nn7 and audio features nn8—with respective frame counts nn9 and SS0 over total clip duration SS1, FPE computes a real-valued position SS2 for video (where SS3), and SS4 for audio (with SS5). These real-valued times are then used in the sinusoidal positional encoding:

SS6

with SS7 the embedding dimension. This ensures that each modality’s feature embedding reflects its true time-of-occurrence, providing the self-attention mechanism direct access to precise inter-modal temporal relationships (Harzig et al., 2021).

3. Core Algorithms and Integration into Transformer Architectures

Insertion Transformer with FPE

The FPE insertion procedure modifies only the input-embedding stage, remaining compatible with conventional multi-head attention and transformer layers. The following pseudocode summarizes a decoding iteration (Zhang et al., 2021):

HH8

This ensures that only newly inserted tokens require re-encoding, while all prior token states are cached and reused.

Audio-Visual Synchronization

The FPE approach for audio-visual fusion computes each feature’s position based on actual time, constructs the appropriate sinusoidal vectors, applies linear projections, and concatenates the sequence before Transformer encoding. Canonical pseudocode (Harzig et al., 2021):

HH9 This preserves temporal relationships both across and within modalities.

4. Computational and Empirical Impact

Adopting FPE in insertion-based decoders eliminates the SS8 per-step re-encoding overhead of absolute PE, reducing overall generation complexity to SS9 for HH0 layers, dimension HH1, and HH2 total tokens, matching the cost of a single left-to-right pass. For example, on WMT14 En→De (Zhang et al., 2021):

Model FLOPs/Instance BLEU Latency (ms)
ABS Insertion 8.69 B 27.45 100.3
REL Insertion 4.68 B 27.40 105.0
FPE Insertion 4.65 B 27.47 97.2
L2R — 27.72 230.1

FPE achieves comparable BLEU to absolute or relative encoding baselines while yielding a 40–50% reduction in floating-point operations and 10–20% reduction in wall-clock latency in both single-instance and batched decoding scenarios. As batch size increases, FPE maintains throughput benefits: at batch size 6K, FPE achieves 2.5 ms/instance, outperforming both REL (2.8 ms) and ABS (4.6 ms) (Zhang et al., 2021).

For video-to-text translation, introducing FPE for both video and audio streams leads to measurable improvements on the VATEX dataset: adding FPE boosts CIDEr from 56.92 to 61.80 (+8.6%), and BLEU-4 from 32.16 to 32.43. Further gains are observed after self-critical sequence training; for example, CIDEr increases from 68.62 to 70.85, and BLEU-4 from 33.92 to 36.30, providing evidence that explicit real-valued temporal encodings support more accurate multimodal alignment and sequence generation (Harzig et al., 2021).

5. Architectural and Implementation Considerations

FPE modifies only the input-encoding stage and does not necessitate changes to the self-attention kernel, transformer layer structure, or attention-relative indexing. The only new parameters are the boundary vectors (HH3, HH4) and a single affine transformation HH5; this overhead is negligible compared to the computational demands of multi-head attention. In insertion Transformers, this enables the reuse of cached hidden states as in left-to-right models, even when tokens are inserted mid-sequence.

Batching strategies remain unchanged: all insertions across a batch can be processed by assembling new HH6 values via a matrix multiply and then encoding new tokens only, regardless of slot position or step; this allows FPE-equipped models to exploit throughput benefits of parallel token insertion (Zhang et al., 2021).

For multi-stream input, such as in video-to-text, the FPE framework provides an explicit and reliable mechanism for cross-modal synchronization at the input level, allowing transformer self-attention to operate on features with globally meaningful and modality-compatible positions (Harzig et al., 2021).

6. Empirical Advancements and Application Domains

FPE has been empirically validated across various text generation and sequence modeling domains. In insertion-based text generation, FPE restores computational efficiency, allowing insertion Transformers to match or exceed the throughput of both absolute and relative positional encoding baselines while avoiding their re-encoding cost. The empirical gains in wall-clock latency and FLOP efficiency translate into practical advantages for large-scale or real-time deployment (Zhang et al., 2021).

In video-to-text translation, integrating FPE for fine-grained synchronization of audio-visual streams leads to new state-of-the-art CIDEr and BLEU-4 scores on both the VATEX and benchmark datasets (MSR-VTT, MSVD), as well as improved robustness to unseen data (Harzig et al., 2021). Precisely encoding feature timestamps proves critical for cross-modal understanding when feature rates and counts do not match.

A plausible implication is that FPE or related continuous positional schemes will be increasingly useful in applications requiring flexible, parallel, or multi-rate sequence modeling, including non-autoregressive or mixed-modality neural architectures.

FPE is conceptually related to learned absolute positional vectors and relative positional encodings, but is distinguished by its ability to avoid index reassignment and provide re-usable, fixed embeddings in highly dynamic or asynchronized generation settings. While absolute PEs excel in simple monotonic decoders, and relative PEs encode pairwise offsets, both are ill-suited for insertion-based or time-synchronized multi-stream tasks without expensive recomputation or limited expressivity.

Research to date has primarily focused on text and video/audio domains. Future developments may involve learning nonlinear or adaptive positional functions HH7, integrating uncertainty in temporal positions, or extending fractional encoding to hierarchical or continuous event domains. The principle of continuous, context-grounded position representation embodied by FPE offers a foundation for further innovations in flexible sequence modeling.


References:

  • "Towards More Efficient Insertion Transformer with Fractional Positional Encoding" (Zhang et al., 2021)
  • "Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation" (Harzig et al., 2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fractional Positional Encoding (FPE).