2D Rotary Positional Encodings (AS2DRoPE)
- The paper introduces AS2DRoPE as an extension of 1D RoPE to efficiently encode 2D spatial relationships using block-diagonal rotations.
- It employs parameter-efficient rotation matrices that preserve relative positional invariance, periodicity, and memory efficiency in Transformers.
- Empirical benchmarks demonstrate that AS2DRoPE achieves lower FLOPs and state-of-the-art accuracy in trajectory prediction and vision tasks compared to traditional RPE methods.
Two-dimensional rotary positional encodings (AS2DRoPE) are a class of positional encoding strategies for Transformers that generalize the one-dimensional Rotary Position Embedding (RoPE) technique to structured, two-dimensional spatial domains. Originally motivated by the need for efficient agent interaction modeling in trajectory prediction, AS2DRoPE and its variants are mathematically grounded in the representation theory of the special orthogonal group and its Lie algebra, and are designed to preserve critical properties such as relative positional invariance, periodicity, and memory efficiency. Unlike explicit relative position embeddings, which incur quadratic memory overhead, these encodings apply parameter-efficient rotations to queries and keys, enabling space and time efficiency at scale while maintaining sensitivity to 2D geometric relationships.
1. Mathematical Formulation and Foundations
AS2DRoPE extends RoPE's formulation from sequences to 2D spatial settings, crucially allowing dot-product attention to depend only on relative spatial or angular displacement. Starting from one-dimensional RoPE, in which channel pairs are rotated by frequency-modulated phases (fixed or learnable), the key observation is that such rotations can be viewed as elements of applied across embedding channel pairs. For scalar position , the rotation block is
This extends to tokens in . If and are 2D coordinates, define the relative offset with angle . The core transformation in AS2DRoPE introduces a uniform identity scalar (typically 1) within each rotation: By tiling identical rotation blocks across all embedding channel pairs, the entire feature is rotated in block-diagonal fashion. For angular encoding, this rotation is parameterized by agent headings : yielding the dot product
ensuring that only the periodic relative angle influences the interaction. For general 2D RoPE, spatial distances are encoded by composing independent 1D RoPEs or, under a Lie-algebraic (maximal toral subalgebra) construction, by exponentiating independent rotations on orthogonal planes for and coordinates. Extensions via orthogonal basis changes allow for coupled rotations and richer geometric structure (Zhao et al., 19 Mar 2025, Liu et al., 7 Apr 2025).
2. Implementation and Computational Properties
The AS2DRoPE mechanism is implemented as follows. For entities with queries , keys , and headings :
1 2 3 4 5 6 7 8 |
for i in 1..N: barQ[i] = BlockDiag(R(φ[i], s), ..., R(φ[i], s)) @ Q[i] barK[i] = BlockDiag(R(φ[i], s), ..., R(φ[i], s)) @ K[i] for i in 1..N: for j in 1..N: score[i,j] = (barQ[i] @ barK[j]) / sqrt(d_k) α[i] = softmax(score[i, 1..N]) O[i] = sum_j α[i,j] * V[j] |
3. Theoretical Properties and Generalization
AS2DRoPE is constructed to guarantee two critical properties:
- Relativity: The attention between positions depends only on via the construction , ensuring relative positional encoding under translation and rotation (Liu et al., 7 Apr 2025).
- Reversibility: The mapping from spatial position to rotation is (locally) injective. For , injectivity modulo periodicity preserves information content.
- Periodicity: By using uniform scalar rotations, the encoding naturally handles angular wraparound, preserving the periodicity needed for orientation in physical domains.
These properties together guarantee extrapolation to unseen positions (arbitrary in ) and shift-invariance of attention matrices.
Lie-algebraic formulations clarify that, for standard 2D RoPE, the rotations generated by , with , span an abelian subalgebra, and optional orthogonal basis mixing enables richer, symmetric coupling of and dimensions (Liu et al., 7 Apr 2025).
4. Comparison to Other 2D Rotary Encoding Schemes
| Method | Rotational Structure | Learnable Coupling | Parameter Count Growth | Core Limitation |
|---|---|---|---|---|
| AS2DRoPE | Block-diagonal, fixed angle | No | Encodes only 2D angle, not distance | |
| LieRE | Blockwise, learnable linear | Yes | Requires trig per-token (Ostmeier et al., 14 Jun 2024) | |
| GeoPE | Quaternion-based, 3D coupled | Implied (quaternion axis) | Non-commutativity in 3D; extra overhead |
AS2DRoPE closely parallels the standard 2D RoPE approach, but with a focus on angular relations (critical for agent interactions). LieRE generalizes to arbitrary learnable projections from into for each embedding block, allowing for higher expressive capacity, demonstrated by empirical improvements in accuracy and sample/data efficiency (Ostmeier et al., 14 Jun 2024). GeoPE introduces rotational coupling of axes via quaternion interpolation and mean in the Lie algebra, enhancing preservation of 2D geometric structure, particularly for vision tasks (Yao et al., 4 Dec 2025).
5. Empirical Results and Benchmark Performance
In agent-centric, query-centric trajectory prediction for autonomous driving, AS2DRoPE achieves the leading minimum average displacement error (minADE) and competitive realism scores on the Waymo SimAgent leaderboard:
- Trajectory Accuracy: DRoPE-Traj (AS2DRoPE) minADE = 1.2626 (best), UniMM = 1.2947.
- Realism Score: DRoPE-Traj = 0.7625 (close to best 0.7702).
- Memory/FLOPs: Unlike RPE, which increases memory/FLOPs by with , AS2DRoPE matches scene-centric models in scaling, with 4–6 lower FLOPs compared to RPE at large dimensions (Zhao et al., 19 Mar 2025).
In image classification and object detection benchmarks (ImageNet-1K, COCO), more expressive 2D rotary schemes (LieRE, GeoPE) further improve accuracy, data efficiency, and shape bias, demonstrating that 2D rotary position encodings are beneficial across a range of high-dimensional structured domains (Yao et al., 4 Dec 2025, Ostmeier et al., 14 Jun 2024).
6. Limitations, Extensions, and Future Directions
While AS2DRoPE is highly parameter- and compute-efficient, its key limitation is that it encodes only angular (directional) differences and not the full 2D displacement vector. To recover general 2D structure, it is standard to combine angle encodings (e.g., agent headings) with separate spatial RoPE or explicit distance encodings. Potential enhancements include:
- Learning block-dependent scalars instead of a fixed uniform .
- Joint 2D polar encoding: representing jointly in a complex-valued embedding.
- Extension to 3D agent orientation by block-diagonal representations, as in generalizations to higher dimensions discussed in (Liu et al., 7 Apr 2025).
Schemes such as LieRE and GeoPE introduce learnable or geometrically coupled bases, offering increased expressiveness to capture complex interactions among spatial dimensions (Ostmeier et al., 14 Jun 2024, Yao et al., 4 Dec 2025).
7. Significance and Broader Context
AS2DRoPE fundamentally enables parameter- and memory-efficient Transformer architectures to robustly model relative spatial relationships, particularly angular differences, in settings where spatial invariance and locality are paramount (e.g., autonomous driving, multi-agent systems, images). Building on a Lie-theoretic foundation guarantees mathematical tractability and extrapolation capacity, while empirical benchmarks demonstrate state-of-the-art results in both agent-centric and vision tasks. The continued development of richer, higher-dimensional rotary encodings and their integration with learnable geometric structures highlight the ongoing convergence between geometric deep learning and positional encoding research (Zhao et al., 19 Mar 2025, Liu et al., 7 Apr 2025, Yao et al., 4 Dec 2025, Ostmeier et al., 14 Jun 2024).