Papers
Topics
Authors
Recent
2000 character limit reached

Input-Dependent Positional Embeddings

Updated 5 January 2026
  • Input-dependent positional embeddings are computed as functions of the input, allowing models to dynamically adjust position representations based on content and context.
  • They improve generalization and expressivity in diverse tasks such as language modeling, graph classification, and medical image analysis by adapting to structural nuances.
  • Empirical results show significant performance gains over static methods, including reduced perplexity in NLP and enhanced accuracy in navigation and organ localization tasks.

Input-dependent positional embeddings are a class of representations in which the encoding of position or location is modulated by the current input or context. Unlike static or fixed positional embeddings—such as sinusoidal or learned absolute encodings—input-dependent schemas compute position vectors as explicit functions of the input (sequence tokens, node features, image intensities, or actions). This design paradigm aims to enable models—spanning the domains of sequence modeling, graph learning, partial differential equations, and image analysis—to adapt position representations dynamically to content, thereby capturing context-sensitive or structural relationships and improving generalization, expressivity, and inductive bias.

1. Mathematical Formulations and Core Principles

The archetypal fixed positional embedding in Transformer models is a deterministic function of the index, such as the sinusoidal encoding pt[k]=sin(t/10000k/d)p_t[k] = \sin(t/10000^{k/d}) of Vaswani et al. In contrast, input-dependent positional embeddings generate position codes via functions pt=g(x1:t)p_t = g(x_{1:t}) which depend not only on the time/index tt but also on the (possibly local) input content or structural signal.

Several technical archetypes exist:

  • Context-aware rotary embeddings (CARoPE): Replace the static phase increment in classic RoPE with a per-token, per-head frequency fh(xt)f_h(x_t) computed from the token embedding via a bounded transformation network. The accumulated phase for dimension-pair ii and head hh at position mm is then

φi(h)(m)=t=1m[fh(xt)]i\varphi_i^{(h)}(m) = \sum_{t=1}^m [f_h(x_t)]^i

These phases parameterize cosines and sines for rotating each attention head's query/key vectors (Veisi et al., 30 Jul 2025).

  • Matrix-based path integration: As in the MapFormer architecture, a sequence of input-conditioned action matrices Mt=exp(s=1tΔsA)M_t = \exp(\sum_{s=1}^t \Delta_s A) acts on a fixed seed vector pp_* to generate pt=Mtpp_t = M_t p_*. Each Δs\Delta_s is a function of token xsx_s, and AA is a block-diagonal skew-symmetric generator, allowing the embedding to accumulate spatial or structural displacements (Rambaud et al., 24 Nov 2025).
  • Dynamic Position Encoding (DPE): The position of each token is refined by a shallow Transformer network conditioned on word and context, with auxiliary supervision to align the output with the position of the token's translation in the target sequence (Zheng et al., 2022).

A unifying trait is that the position code is not a lookup but instead computed, adapted, or accumulated as a (possibly nonlinear) function of input, prior positions, and side information.

2. Architectural Instantiations Across Domains

Various model architectures integrate input-dependent positional embeddings:

Domain Model / Approach Mechanism
NLP (LM/NMT) CARoPE, DPE, MapFormer Per-token and per-head frequency networks; input-dependent rotation matrices
Graphs GAT-POS Per-node positional vectors refined by auxiliary neural network via graph topology
Imaging Anatomical PE (APE) Voxel-wise 3D CNN outputs, embedding local radiological context
PDEs Laplace Eigen PE Embedding coordinates as functions of domain, BC, and geometric structure
Coordinate MLP Graph-Lap. PE Learned per-instance scales/widths via Laplacian regularization over embedding

Transformers: In CARoPE, the input-dependent computation is efficiently realized by an extra linear projection per token plus softplus activation and normalization, with cumulative phases enabling token- and head-specific rotations while retaining RoPE's architectural efficiency.

Graph Neural Networks: GAT-POS computes per-node codes with a shallow MLP, initialized randomly and optimized both via context prediction (skip-gram) and the main task loss. These codes are injected as additive features to the attention mechanism, allowing the network to capture structural and functional locality beyond vanilla GAT (Ma et al., 2021).

Medical Imaging: APE leverages a fully convolutional 3D U-Net trained self-supervisedly to enforce local isometry between embedding and anatomical space. The encoder's receptive field ensures each voxel's code depends on both its spatial coordinate and local intensity context, enabling rapid, continuous, and efficient embedding of anatomical location (Goncharov et al., 2024).

Coordinate MLPs / PDEs: Position is encoded via learned, input-specific radial basis features or eigenfunctions of the Laplace–Beltrami operator, finely tuning the embedding width/scales as a function of local gradient structure or domain geometry (Kast et al., 2023, Ramasinghe et al., 2021).

3. Advantages and Empirical Impact

Input-dependent positional embeddings enable several benefits demonstrated empirically:

  • Context and content adaptation: By coupling position coding to input, models can natively represent context-sensitive relationships, such as action-induced displacement, structural variation, or patient-specific anatomy (Rambaud et al., 24 Nov 2025, Goncharov et al., 2024).
  • Long-context and out-of-distribution (OOD) generalization: CARoPE achieves a 54.8–62.2% reduction in perplexity over RoPE at doubled context lengths, and MapFormer obtains near-perfect OOD accuracy in navigation tasks where static relative encodings fail (Veisi et al., 30 Jul 2025, Rambaud et al., 24 Nov 2025).
  • Improved gradient stability and function representation: For coordinate-MLPs, local super-Gaussians with learned width yield smoother gradients and better function approximation than RFF, with test PSNR improvements of ~5dB on 1D signal reconstruction, and generalize across signals without parameter retuning (Ramasinghe et al., 2021).
  • Enhanced task performance: GAT-POS notably improves node classification accuracy on non-homophilic graphs, with up to +11% on Actor/Squirrel compared to standard GAT (Ma et al., 2021). APE sets new state-of-the-art for voxelwise few-shot organ localization with high data efficiency and throughput (Goncharov et al., 2024).

4. Training Methodologies and Computational Considerations

Input-dependent positional embeddings introduce additional modules (often lightweight) and complexity but remain scalable.

  • CARoPE: Adds a T×d×HT \times d \times H linear projection (plus elementwise softplus and reciprocal), and a prefix sum computation for phase accumulation. Cost per token scales as O(dH)O(d \cdot H), with rotation cost identical to RoPE. Vectorization and precomputation can mitigate overhead (Veisi et al., 30 Jul 2025).
  • DPE: Two standard Transformer encoder blocks act as the DPE module, with parameter overhead at +7M parameters on Transformer-Base. An auxiliary alignment loss is required in training, but not at inference. Loss function is a convex combination of translation and order loss with λ[0,1]\lambda \in [0,1] (Zheng et al., 2022).
  • Graph PE: GAT-POS's per-node codes are small (typically d=64d=64), and training proceeds end-to-end with joint losses; skip-gram context loss and main supervised classification (Ma et al., 2021).
  • APE: The APE U-Net infers entire voxel grids in ~0.07s per CT scan (three output channels for the whole map), with batchnorm ensuring architectural regularity (Goncharov et al., 2024).
  • Laplace eigenfunctions: For PDEs, input-dependent coding via harmonic basis provides boundary-condition satisfaction and improved convergence without explicit BC loss terms. Matrix-free Krylov solvers and active collocation sampling are used to keep time/memory practical (Kast et al., 2023).

5. Comparison to Static and Alternative Methods

Distinctive aspects of input-dependent approaches include:

  • Versus sinusoidal/fixed encodings: Static encodings cannot break symmetries, represent dynamic interactions, or adapt to input variation. They are unable to model reordering (DPE), anatomy (APE), or spatial displacement (MapFormer) (Zheng et al., 2022, Goncharov et al., 2024, Rambaud et al., 24 Nov 2025).
  • Versus learned but input-independent encodings: Learned absolute or relative encodings, while offering flexibility, remain fixed after training and cannot reflect per-instance structural changes or context gating (Rambaud et al., 24 Nov 2025).
  • Versus random feature maps (RFF): Gaussian/Laplacian input-dependent embeddings can be tuned to gradient structure via Laplacian regularization, offering better test error and smoother gradients than RFF, which is sensitive to sampling/frequency grid selection (Ramasinghe et al., 2021).
  • Relative encodings (RoPE, CoPE): RoPE's input-independent rotation induces deterministic structure; CARoPE and MapFormer generalize this to context-aware, input-driven rotations (Veisi et al., 30 Jul 2025, Rambaud et al., 24 Nov 2025).

6. Limitations, Open Challenges, and Extensions

  • Computational and memory cost: Dynamic computation of embeddings can introduce moderate extra overhead, especially with large dd or deep frequency networks; vectorization partly alleviates this (Veisi et al., 30 Jul 2025).
  • Expressivity and regularization: CARoPE's one-layer frequency network is deliberately shallow to limit overfitting, but deeper MLPs, attention/batch normalization, or mixture-of-experts can further improve head-specific code balance and expressivity (Veisi et al., 30 Jul 2025).
  • Supervision requirements: DPE needs alignment data or proxy reordering targets during training, which may not be universal (Zheng et al., 2022).
  • Head/channel imbalance: For architectures with many attention heads or large channel counts, normalizing or regularizing the input-dependent scale is needed to avoid dimensional collapse (Veisi et al., 30 Jul 2025).
  • Potential directions: Bidirectional coding, dynamic mixture-of-experts, relative-bias integration, and tighter coupling to structural operators are under active research (Veisi et al., 30 Jul 2025, Rambaud et al., 24 Nov 2025).

7. Applications and Empirical Results Overview

Empirical validation spans large-scale language modeling, machine translation, graph classification, medical imaging, PDE surrogates, and beyond.

  • LLMs: CARoPE consistently reduces perplexity by 0.4–62.2% relative to RoPE, particularly for extrapolated context windows. Throughput also increases (~0.76M tokens/s vs. 0.63M for RoPE) (Veisi et al., 30 Jul 2025).
  • Cognitive navigation and path integration: MapFormer achieves >>0.99 accuracy on OOD navigation tasks, learning and generalizing the group structure of spatial sequences (Rambaud et al., 24 Nov 2025).
  • Graph benchmarks: GAT-POS improves classification by up to 11% on non-homophilic graphs (Ma et al., 2021).
  • Medical image analysis: APE sets state-of-the-art for few-shot organ localization, offering 0.99 recall and up to 100-fold volume reduction (Goncharov et al., 2024).
  • Coordinate regression: Learned Laplacian embeddings yield smooth plug-in layers for coordinate MLPs, outperforming random Fourier basis on regression and stability (Ramasinghe et al., 2021).
  • PDE solving: Intrinsic harmonic embeddings automatically enforce BCs and yield 1–2 order improvements in error for static problems compared to non-input-dependent features (Kast et al., 2023).

A plausible implication is that input-dependent positional embeddings, when carefully structured, both preserve the computational efficiency of their static predecessors and tightly couple spatial, sequential, or structural information to content—providing new inductive biases and robustness in models across a diverse set of modalities and problem settings.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Input-Dependent Positional Embeddings.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube