Papers
Topics
Authors
Recent
2000 character limit reached

Auto-Scaled 2D Rotational Positional Encoding

Updated 17 December 2025
  • AS2DRoPE is a positional encoding framework that generalizes 1D rotary embeddings to two spatial dimensions by incorporating learnable per-axis scaling.
  • It employs adaptive techniques such as per-axis amplitude scaling, Fourier/MLP modulation, and dynamic frequency modulation to maintain meaningful relative geometry.
  • The approach leverages Lie algebra and group-representational theory to ensure rotational symmetry, relativity, and reversibility, leading to improved empirical performance.

Auto-Scaled 2D Rotational Positional Encoding (AS2DRoPE) is a positional encoding framework that generalizes traditional Rotary Position Embedding (RoPE) to two spatial dimensions and augments it with adaptive, learned scaling mechanisms. AS2DRoPE preserves critical algebraic properties necessary for meaningful relative positional geometry in structured input domains while introducing end-to-end learnable parameters for per-axis frequencies or amplitudes. This approach is realized in both vision (as a geometric extension of quaternion-based rotation) and other structured tensor contexts, with practical design variants explicitly grounded in Lie algebra theory, MASA constructions, and group-representational frameworks. AS2DRoPE improves manifold-awareness, geometric symmetry, and shape bias while allowing the network to discover optimal frequency or amplitude modulations for the spatial dimensions.

1. Mathematical Formulation and Algebraic Foundations

AS2DRoPE extends RoPE from 1D sequence modeling to two spatial dimensions (px,py)(p_x, p_y), encoding each coordinate via rotational actions in a real vector space. Its core construction requires the identification of two commuting skew-symmetric generators $B_x, B_y \in \so(d)$ forming a maximal Abelian subalgebra (MASA). The foundational requirements are:

  • Relativity: The rotation at relative shift is equivalent to the composition of separate rotations:

R(px1,py1)R(px2,py2)=R(px2px1,py2py1)R_{(p_x^1,p_y^1)}^\top\,R_{(p_x^2,p_y^2)} = R_{(p_x^2 - p_x^1,\, p_y^2 - p_y^1)}

  • Reversibility: Only identical spatial positions map to identical encodings:

R(px1,py1)=R(px2,py2)    (px1,py1)=(px2,py2)R_{(p_x^1,p_y^1)} = R_{(p_x^2,p_y^2)} \implies (p_x^1, p_y^1) = (p_x^2, p_y^2)

Given these, AS2DRoPE's central rotation is: R(px,py)=exp(αx(px)Ex+αy(py)Ey)R_{(p_x,p_y)} = \exp\bigl(\alpha_x(p_x) E_x + \alpha_y(p_y) E_y\bigr) where Ex,EyE_x, E_y generate planar rotations, and αx,αy\alpha_x, \alpha_y are learnable, position-dependent scalings. The block-diagonal structure allows independent rotation in the two subspaces, but orthogonal basis parameterization (e.g., via $Q \in \SO(2N)$) can introduce controlled cross-axis interactions while preserving commutativity (Liu et al., 7 Apr 2025).

Quaternion-based geometric interpretation, as in GeoPE (Yao et al., 4 Dec 2025), lifts the encoding to genuine 3D rotations. Spatial positions (ph,pw)(p_h, p_w) are mapped to quaternions via: rh(θh)=cos(θh/2)+sin(θh/2)j,rw(θw)=cos(θw/2)+sin(θw/2)kr_h(\theta_h) = \cos(\theta_h/2) + \sin(\theta_h/2)\mathbf{j}, \quad r_w(\theta_w) = \cos(\theta_w/2) + \sin(\theta_w/2)\mathbf{k} and their Lie-algebraic geometric mean yields axis-symmetric, commutative rotational bias.

2. Auto-Scaling Mechanisms

AS2DRoPE incorporates auto-scaling by allowing the effective rotational phase or amplitude of each axis to be learned or adaptively modulated. Three principal approaches are described:

  1. Per-axis amplitude scaling:

θx=αxθx,θy=αyθy\theta_x' = \alpha_x\,\theta_x,\quad \theta_y' = \alpha_y\,\theta_y

with αx,αy\alpha_x, \alpha_y being scalar, vector, or neural-network outputs.

  1. Learnable Fourier or MLP scaling:

αx(px)=wxpx+k[ax,ksin(ωx,kpx)+bx,kcos(ωx,kpx)]\alpha_x(p_x) = w_x p_x + \sum_{k} [a_{x,k} \sin(\omega_{x,k} p_x) + b_{x,k} \cos(\omega_{x,k} p_x)]

generalizes the static frequency grid, introducing nonstationary, data-dependent scaling (Liu et al., 7 Apr 2025).

  1. Dynamic frequency modulation: Replace static RoPE frequencies with a learned vector ϕi\phi_i or parameterized functions:

θx=pxϕi,θy=pyϕi\theta_x = p_x \phi_i,\quad \theta_y = p_y \phi_i

The adaptation preserves relativity and reversibility as long as the parameterization is injective and the Lie algebra generators remain linearly independent.

3. Group-Representational and Lie-Theoretic Perspective

AS2DRoPE is further unified in GRAPE’s group-representational framework, modeling each spatial coordinate as an element acting on $\SO(d)$ by exponentiating learned skew-symmetric generators:

G2D(u,v)=exp(uωxL(x))exp(vωyL(y))G_{2D}(u,v) = \exp(u\,\omega_x L^{(x)}) \exp(v\,\omega_y L^{(y)})

Each plane's frequency ωk=exp(φk)\omega_k = \exp(\varphi_k) is initialized (e.g., log-uniform RoPE grid) and updated end-to-end by standard gradient descent (Zhang et al., 8 Dec 2025). This construction applied to two commuting generators (x,yx, y planes) recovers 2D RoPE or its auto-scaled variants, which can be efficiently implemented with block-diagonal rotation actions, incurring only O(d)O(d) per token computational cost and retaining streaming cacheability.

4. Empirical Evaluation and Benchmarking

Empirical studies across multiple domains confirm AS2DRoPE's superiority or competitiveness with previously established positional encoding schemes.

  • ImageNet-1K (ViT/Swin):
    • GeoPE achieves higher top-1 accuracy compared to APE, CPE, and axial RoPE variants:
    • 1
      2
      3
      4
      
      ViT-Small: APE 79.9%, CPE 80.7%, GeoPE 81.2%
      ViT-Base:  APE 81.3%, CPE 82.2%, GeoPE 82.5%
      ViT-Large: APE 83.3%, CPE 83.6%, GeoPE 83.9%
      Swin-Small: RPB 83.0%, RoPE-Mixed 83.4%, GeoPE 83.5%
  • COCO detection (ViT-Base):

1
mAP: APE 49.4, RoPE-Mixed 51.2, GeoPE 51.3

  • S3DIS (PointTransformer):

1
OA: 90.2→90.5, mAcc: 81.9→82.1, mIoU: 73.5→74.4

  • Language modeling (GRAPE-M, 355M, FineWeb-Edu):
    • Learned-frequency GRAPE-M yields lower validation loss and improved zero-shot performance versus classic RoPE.
  • Trajectory modeling (DRoPE):
    • DRoPE and AS2DRoPE maintain O(NH(2dk+dv))O(NH(2d_k+d_v)) complexity and provide small but consistent minADE/REALISM improvements (Zhao et al., 19 Mar 2025).

Shape bias tests indicate a notable shift toward human-like shape decisions under manifold-aware rotational encodings (Yao et al., 4 Dec 2025).

5. Implementation Variants and Practical Aspects

Several practical instantiations of AS2DRoPE are documented:

  • Quaternion-Lie averaging (GeoPE): Employ per-axis scaling before geometric mean and conversion back to group element (see explicit formulas above).
  • MASA + auto-scaling (Lie-algebraic): Learn both frequencies and basis transforms. Orthogonal change-of-basis via Cayley transform, exponentiation, or Givens rotations allows axis mixing without loss of commutativity (Liu et al., 7 Apr 2025).
  • Group-representational GRAPE-M: Define learned L(x),L(y)L^{(x)}, L^{(y)}, apply via fast 2D block-rotations, scale via ωx,ωy\omega_x, \omega_y.
  • DRoPE derivatives: Scale 2D rotary angle by adaptive ss per-agent, per-dimension, or per-head (Zhao et al., 19 Mar 2025).

Efficient implementation is facilitated by blockwise vector partitioning and streaming updates. Pseudocode and computational cost analyses confirm negligible overhead relative to baseline RoPE architectures.

6. Scope, Limitations, and Prospects

AS2DRoPE generalizes 2D positional encoding by marrying symmetry, relativity, and manifold-awareness with adaptivity in rotational frequency and amplitude. This is particularly advantageous in domains where spatial axes are coupled (vision, trajectory modeling, structured tensors) and channel-, head-, or coordinate-specific modulations are beneficial. AS2DRoPE’s design principles are supported by strict Lie-algebraic theory (MASA) (Liu et al., 7 Apr 2025), geometric averaging (Yao et al., 4 Dec 2025), and group-representational foundations (Zhang et al., 8 Dec 2025).

Limitations are primarily those of underlying rotational positional encoding mechanisms: injectivity or monotonicity of learned scalings must be preserved, and full mixing of spatial axes via non-commutative generators incurs additional computational cost. Empirical ablations indicate most practical gains are realized via two commuting planes, with further cross-plane couplings providing incremental flexibility (Zhang et al., 8 Dec 2025).

A plausible implication is that further extensions toward non-commuting mixtures, higher-dimensional structured domains, or dynamic spatial vocabularies will continue to benefit from the principled group/Lie-algebraic framework established in AS2DRoPE literature.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Auto-Scaled 2D Rotational Positional Encoding (AS2DRoPE).