Input-Dependent Positional Embeddings

Updated 5 January 2026

Input-dependent positional embeddings are computed as functions of the input, allowing models to dynamically adjust position representations based on content and context.
They improve generalization and expressivity in diverse tasks such as language modeling, graph classification, and medical image analysis by adapting to structural nuances.
Empirical results show significant performance gains over static methods, including reduced perplexity in NLP and enhanced accuracy in navigation and organ localization tasks.

Input-dependent positional embeddings are a class of representations in which the encoding of position or location is modulated by the current input or context. Unlike static or fixed positional embeddings—such as sinusoidal or learned absolute encodings—input-dependent schemas compute position vectors as explicit functions of the input (sequence tokens, node features, image intensities, or actions). This design paradigm aims to enable models—spanning the domains of sequence modeling, graph learning, partial differential equations, and image analysis—to adapt position representations dynamically to content, thereby capturing context-sensitive or structural relationships and improving generalization, expressivity, and inductive bias.

1. Mathematical Formulations and Core Principles

The archetypal fixed positional embedding in Transformer models is a deterministic function of the index, such as the sinusoidal encoding $p_t[k] = \sin(t/10000^{k/d})$ of Vaswani et al. In contrast, input-dependent positional embeddings generate position codes via functions $p_t = g(x_{1:t})$ which depend not only on the time/index $t$ but also on the (possibly local) input content or structural signal.

Several technical archetypes exist:

Context-aware rotary embeddings (CARoPE): Replace the static phase increment in classic RoPE with a per-token, per-head frequency $f_h(x_t)$ computed from the token embedding via a bounded transformation network. The accumulated phase for dimension-pair $i$ and head $h$ at position $m$ is then

$\varphi_i^{(h)}(m) = \sum_{t=1}^m [f_h(x_t)]^i$

These phases parameterize cosines and sines for rotating each attention head's query/key vectors (Veisi et al., 30 Jul 2025).

Matrix-based path integration: As in the MapFormer architecture, a sequence of input-conditioned action matrices $M_t = \exp(\sum_{s=1}^t \Delta_s A)$ acts on a fixed seed vector $p_*$ to generate $p_t = M_t p_*$ . Each $\Delta_s$ is a function of token $x_s$ , and $A$ is a block-diagonal skew-symmetric generator, allowing the embedding to accumulate spatial or structural displacements (Rambaud et al., 24 Nov 2025).
Dynamic Position Encoding (DPE): The position of each token is refined by a shallow Transformer network conditioned on word and context, with auxiliary supervision to align the output with the position of the token's translation in the target sequence (Zheng et al., 2022).

A unifying trait is that the position code is not a lookup but instead computed, adapted, or accumulated as a (possibly nonlinear) function of input, prior positions, and side information.

2. Architectural Instantiations Across Domains

Various model architectures integrate input-dependent positional embeddings:

Domain	Model / Approach	Mechanism
NLP (LM/NMT)	CARoPE, DPE, MapFormer	Per-token and per-head frequency networks; input-dependent rotation matrices
Graphs	GAT-POS	Per-node positional vectors refined by auxiliary neural network via graph topology
Imaging	Anatomical PE (APE)	Voxel-wise 3D CNN outputs, embedding local radiological context
PDEs	Laplace Eigen PE	Embedding coordinates as functions of domain, BC, and geometric structure
Coordinate MLP	Graph-Lap. PE	Learned per-instance scales/widths via Laplacian regularization over embedding

Transformers: In CARoPE, the input-dependent computation is efficiently realized by an extra linear projection per token plus softplus activation and normalization, with cumulative phases enabling token- and head-specific rotations while retaining RoPE's architectural efficiency.

Graph Neural Networks: GAT-POS computes per-node codes with a shallow MLP, initialized randomly and optimized both via context prediction (skip-gram) and the main task loss. These codes are injected as additive features to the attention mechanism, allowing the network to capture structural and functional locality beyond vanilla GAT (Ma et al., 2021).

Medical Imaging: APE leverages a fully convolutional 3D U-Net trained self-supervisedly to enforce local isometry between embedding and anatomical space. The encoder's receptive field ensures each voxel's code depends on both its spatial coordinate and local intensity context, enabling rapid, continuous, and efficient embedding of anatomical location (Goncharov et al., 2024).

Coordinate MLPs / PDEs: Position is encoded via learned, input-specific radial basis features or eigenfunctions of the Laplace–Beltrami operator, finely tuning the embedding width/scales as a function of local gradient structure or domain geometry (Kast et al., 2023, Ramasinghe et al., 2021).

3. Advantages and Empirical Impact

Input-dependent positional embeddings enable several benefits demonstrated empirically:

Context and content adaptation: By coupling position coding to input, models can natively represent context-sensitive relationships, such as action-induced displacement, structural variation, or patient-specific anatomy (Rambaud et al., 24 Nov 2025, Goncharov et al., 2024).
Long-context and out-of-distribution (OOD) generalization: CARoPE achieves a 54.8–62.2% reduction in perplexity over RoPE at doubled context lengths, and MapFormer obtains near-perfect OOD accuracy in navigation tasks where static relative encodings fail (Veisi et al., 30 Jul 2025, Rambaud et al., 24 Nov 2025).
Improved gradient stability and function representation: For coordinate-MLPs, local super-Gaussians with learned width yield smoother gradients and better function approximation than RFF, with test PSNR improvements of ~5dB on 1D signal reconstruction, and generalize across signals without parameter retuning (Ramasinghe et al., 2021).
Enhanced task performance: GAT-POS notably improves node classification accuracy on non-homophilic graphs, with up to +11% on Actor/Squirrel compared to standard GAT (Ma et al., 2021). APE sets new state-of-the-art for voxelwise few-shot organ localization with high data efficiency and throughput (Goncharov et al., 2024).

4. Training Methodologies and Computational Considerations

Input-dependent positional embeddings introduce additional modules (often lightweight) and complexity but remain scalable.

CARoPE: Adds a $T \times d \times H$ linear projection (plus elementwise softplus and reciprocal), and a prefix sum computation for phase accumulation. Cost per token scales as $O(d \cdot H)$ , with rotation cost identical to RoPE. Vectorization and precomputation can mitigate overhead (Veisi et al., 30 Jul 2025).
DPE: Two standard Transformer encoder blocks act as the DPE module, with parameter overhead at +7M parameters on Transformer-Base. An auxiliary alignment loss is required in training, but not at inference. Loss function is a convex combination of translation and order loss with $\lambda \in [0,1]$ (Zheng et al., 2022).
Graph PE: GAT-POS's per-node codes are small (typically $d=64$ ), and training proceeds end-to-end with joint losses; skip-gram context loss and main supervised classification (Ma et al., 2021).
APE: The APE U-Net infers entire voxel grids in ~0.07s per CT scan (three output channels for the whole map), with batchnorm ensuring architectural regularity (Goncharov et al., 2024).
Laplace eigenfunctions: For PDEs, input-dependent coding via harmonic basis provides boundary-condition satisfaction and improved convergence without explicit BC loss terms. Matrix-free Krylov solvers and active collocation sampling are used to keep time/memory practical (Kast et al., 2023).

5. Comparison to Static and Alternative Methods

Distinctive aspects of input-dependent approaches include:

Versus sinusoidal/fixed encodings: Static encodings cannot break symmetries, represent dynamic interactions, or adapt to input variation. They are unable to model reordering (DPE), anatomy (APE), or spatial displacement (MapFormer) (Zheng et al., 2022, Goncharov et al., 2024, Rambaud et al., 24 Nov 2025).
Versus learned but input-independent encodings: Learned absolute or relative encodings, while offering flexibility, remain fixed after training and cannot reflect per-instance structural changes or context gating (Rambaud et al., 24 Nov 2025).
Versus random feature maps (RFF): Gaussian/Laplacian input-dependent embeddings can be tuned to gradient structure via Laplacian regularization, offering better test error and smoother gradients than RFF, which is sensitive to sampling/frequency grid selection (Ramasinghe et al., 2021).
Relative encodings (RoPE, CoPE): RoPE's input-independent rotation induces deterministic structure; CARoPE and MapFormer generalize this to context-aware, input-driven rotations (Veisi et al., 30 Jul 2025, Rambaud et al., 24 Nov 2025).

6. Limitations, Open Challenges, and Extensions

Computational and memory cost: Dynamic computation of embeddings can introduce moderate extra overhead, especially with large $d$ or deep frequency networks; vectorization partly alleviates this (Veisi et al., 30 Jul 2025).
Expressivity and regularization: CARoPE's one-layer frequency network is deliberately shallow to limit overfitting, but deeper MLPs, attention/batch normalization, or mixture-of-experts can further improve head-specific code balance and expressivity (Veisi et al., 30 Jul 2025).
Supervision requirements: DPE needs alignment data or proxy reordering targets during training, which may not be universal (Zheng et al., 2022).
Head/channel imbalance: For architectures with many attention heads or large channel counts, normalizing or regularizing the input-dependent scale is needed to avoid dimensional collapse (Veisi et al., 30 Jul 2025).
Potential directions: Bidirectional coding, dynamic mixture-of-experts, relative-bias integration, and tighter coupling to structural operators are under active research (Veisi et al., 30 Jul 2025, Rambaud et al., 24 Nov 2025).

7. Applications and Empirical Results Overview

Empirical validation spans large-scale language modeling, machine translation, graph classification, medical imaging, PDE surrogates, and beyond.

LLMs: CARoPE consistently reduces perplexity by 0.4–62.2% relative to RoPE, particularly for extrapolated context windows. Throughput also increases (~0.76M tokens/s vs. 0.63M for RoPE) (Veisi et al., 30 Jul 2025).
Cognitive navigation and path integration: MapFormer achieves $>$ 0.99 accuracy on OOD navigation tasks, learning and generalizing the group structure of spatial sequences (Rambaud et al., 24 Nov 2025).
Graph benchmarks: GAT-POS improves classification by up to 11% on non-homophilic graphs (Ma et al., 2021).
Medical image analysis: APE sets state-of-the-art for few-shot organ localization, offering 0.99 recall and up to 100-fold volume reduction (Goncharov et al., 2024).
Coordinate regression: Learned Laplacian embeddings yield smooth plug-in layers for coordinate MLPs, outperforming random Fourier basis on regression and stability (Ramasinghe et al., 2021).
PDE solving: Intrinsic harmonic embeddings automatically enforce BCs and yield 1–2 order improvements in error for static problems compared to non-input-dependent features (Kast et al., 2023).

A plausible implication is that input-dependent positional embeddings, when carefully structured, both preserve the computational efficiency of their static predecessors and tightly couple spatial, sequential, or structural information to content—providing new inductive biases and robustness in models across a diverse set of modalities and problem settings.

Markdown Upgrade to Chat

References (7)

Context-aware Rotary Position Embedding (2025)

MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings (2025)

Dynamic Position Encoding for Transformers (2022)

Graph Attention Networks with Positional Embeddings (2021)

Anatomical Positional Embeddings (2024)

Positional Embeddings for Solving PDEs with Evolutional Deep Neural Networks (2023)

Learning Positional Embeddings for Coordinate-MLPs (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Input-Dependent Positional Embeddings.

Input-Dependent Positional Embeddings

1. Mathematical Formulations and Core Principles

2. Architectural Instantiations Across Domains

3. Advantages and Empirical Impact

4. Training Methodologies and Computational Considerations

5. Comparison to Static and Alternative Methods

6. Limitations, Open Challenges, and Extensions

7. Applications and Empirical Results Overview

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Input-Dependent Positional Embeddings

1. Mathematical Formulations and Core Principles

2. Architectural Instantiations Across Domains

3. Advantages and Empirical Impact

4. Training Methodologies and Computational Considerations

5. Comparison to Static and Alternative Methods

6. Limitations, Open Challenges, and Extensions

7. Applications and Empirical Results Overview

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research