2D Rotary Position Embeddings (RoPE)
- 2D RoPE is a technique that generalizes rotary position encoding to spatial domains by embedding continuous, relative multi-axis position information into transformer attention.
- It leverages block-diagonal rotations and commuting skew-symmetric matrices to ensure translation invariance and robust extrapolation across varied resolutions.
- Empirical evaluations show that 2D RoPE variants improve performance in image classification, segmentation, and multimodal tasks, offering efficient and scalable positional encoding.
Two-dimensional Rotary Position Embeddings (2D RoPE) generalize the foundational concept of rotary position encoding from 1D sequences to spatial or higher-dimensional domains, enabling Transformer-based architectures to encode continuous, relative, multi-axis position information directly and efficiently into their attention mechanisms. Originating from the block-diagonal rotation formulation introduced in RoFormer, recent advances have both formalized and substantially extended the expressive capacity, robustness, and geometric fidelity of RoPE for image, multimodal, and geometric reasoning tasks.
1. Mathematical Foundations and General RoPE Equation
2D RoPE seeks a position-dependent transformation such that for all positions :
This “RoPE Equation” ensures that the dot product between position-encoded queries and keys in self-attention depends only on the relative offset, yielding translation-invariant logits and seamless extrapolation to unseen resolution or field-of-view. The most general solution is to construct via the matrix exponential of a linear map generated by pairwise commuting skew-symmetric matrices acting on each coordinate axis:
with for all (Yu et al., 4 Jun 2025, Schenck et al., 4 Feb 2025, Liu et al., 7 Apr 2025).
In the block-diagonal case (maximal toral subalgebra), each acts on a disjoint or 0 block, and 1 factors as independent rotations per axis, corresponding to standard axis-aligned 2D RoPE (Su et al., 2021, Heo et al., 2024, Schenck et al., 4 Feb 2025). If the commutativity requirement is violated, as in generic learned Lie rotations, relative position dependence is lost and robustness deteriorates (Yu et al., 4 Jun 2025).
2. Constructive Parameterizations and Model Variants
Several parameterizations have been developed to instantiate or generalize 2D RoPE while maintaining the critical commutativity condition:
- Axis-Aligned RoPE (Block-Diagonal): Each input feature is split into halves, with independent 2D RoPE applied on 3 and 4 axes using block-diagonal 5 rotations at log-uniform frequencies. This construction is computationally efficient, parameter-free, and translation-invariant, but cannot express diagonal or off-axis interactions (Su et al., 2021, Heo et al., 2024).
- ComRoPE (“Commuting RoPE”): Generalizes 2D RoPE by introducing trainable block-diagonal angle matrices 6 for each coordinate, with enforced pairwise commutativity. ComRoPE-AP (axial partition) uses mutually exclusive block supports; ComRoPE-LD (linearly dependent) defines 7 for a single base generator 8. Both maintain relative-position robustness and empirical performance beyond fixed-frequency RoPE (Yu et al., 4 Jun 2025).
- STRING (Universal Lie Exponential): Extends RoPE to arbitrary coordinate dimensionality by exponentiating a sum of 9 commuting skew-symmetric generators, providing a universal construction for all translation-invariant, separable position encodings. This formalism unifies axial, diagonal, and block-structured RoPE as special cases (Schenck et al., 4 Feb 2025).
- Learned Diagonal/LieRE: LieRE introduces a fully learned linear map from displacement vectors to the Lie algebra so(2), allowing free scalar weights but sacrificing the commutativity guarantee. This increases flexibility but can degrade large-offset generalization in higher-dimensional domains or under random coordinate perturbations (Ostmeier et al., 2024, Yu et al., 4 Jun 2025).
- Spiral RoPE: Overcomes axis-aligned limitations by partitioning embedding channels into multiple directional groups, each rotated according to the patch’s projection onto a uniformly distributed set of directions. This approach covers the entire frequency plane and better encodes oblique and curved spatial relationships without added parameters or computational cost (Liu et al., 3 Feb 2026).
- GeoPE (Quaternionic RoPE): Lifts 2D positions to 3D using quaternionic representations and constructs a symmetric (commuting in Lie algebra) joint rotation by averaging log-maps of axis-rotations, thus capturing true 2D spatial topology and shape bias. This avoids the “false neighbor” problem induced by flattening (Yao et al., 4 Dec 2025).
3. Implementation and Computational Complexity
The practical implementation of 2D RoPE retains the linear complexity and negligible memory overhead characteristic of the 1D variant, with computational cost dominated by the final matrix multiplication in attention:
- Block-diagonal axis-aligned and ComRoPE: Per-token computational overhead is 0 with 1 total extra time per layer (2 tokens, 3 dimension). Additional parameters per layer are 4 (ComRoPE-AP) or 5 (ComRoPE-LD), both modest compared to vanilla transformer parameter budgets (Yu et al., 4 Jun 2025).
- STRING, LieRE: Arbitrarily complex generator families require at worst 6 (naïve), but efficient reductions (sparse/FFT/Cayley basis) achieve 7 or 8 (Schenck et al., 4 Feb 2025, Liu et al., 7 Apr 2025).
- Spiral RoPE: Parameter-free and FLOPs-equivalent to block-diagonal RoPE (Liu et al., 3 Feb 2026).
No variant above materially increases attention memory footprint relative to absolute position embeddings or O(9) RPE tables.
4. Applications and Empirical Performance
2D RoPE and its generalizations have achieved state-of-the-art or superior performance across a spectrum of modalities and tasks, particularly where translation-invariance, spatial extrapolation, and multi-scale structure are essential. Key empirical findings include:
- Image Classification (ImageNet-1K): ComRoPE-LD achieves 65.49% top-1 at 224x224 and 55.29% at 512x512, outperforming standard RoPE (+2.4% at train, +4.2% at extrapolated resolution) and LieRE (+1.6% train, +2.9% high-res) (Yu et al., 4 Jun 2025). Spiral RoPE yields consistent gains up to +0.88% top-1 (ViT-B, 384x384) over axis-aligned, and semantically crisper attention (Liu et al., 3 Feb 2026).
- Semantic Segmentation and Object Detection: Spiral RoPE (+2.21% mIoU at 512x512 compared to APE on ADE20k) and GeoPE (+0.3%–1.1% absolute) show measurable, robust improvements (Liu et al., 3 Feb 2026, Yao et al., 4 Dec 2025).
- Vision-Language and Multimodal Tasks: Both STRING-based and quaternion-based encodings enhance recall/IoU, especially in geometric and retrieval scenarios (Schenck et al., 4 Feb 2025, Yao et al., 4 Dec 2025).
- Trajectory and Agent-Centric Modeling: Directional RoPE (DRoPE) for agent heading breaks the periodicity/ambiguity of 1D RoPE and achieves state-of-the-art accuracy/efficiency trade-off in autonomous driving benchmarks, without quadratic space overhead (Zhao et al., 19 Mar 2025).
5. Geometric and Theoretical Properties
A foundational property of 2D RoPE is its strict dependence of attention scores on relative position, guaranteed by the mathematical structure: 0. The underlying requirements are:
- Commutativity of Generators: Required for scalability to high-dimensional or multidimensional input; encodes the “translation invariance” at the heart of robust, resolution-agnostic vision transformers (Yu et al., 4 Jun 2025, Schenck et al., 4 Feb 2025).
- Embedding in Maximal Abelian Subalgebras (MASA) of 1: All valid 2D RoPEs correspond to choosing bases in these subalgebras; block-diagonal (axis-aligned) forms are the maximal toral case (Liu et al., 7 Apr 2025).
- Norm Preservation, Compositionality: All RoPEs (when constructed via orthogonal exponentials) preserve vector norms and admit streaming/caching by virtue of their group structure (Zhang et al., 8 Dec 2025).
- Diagonal versus Mixed-Directionality: Axis-aligned RoPE is limited to frequencies lying on principal axes; Spiral RoPE, GeoPE, and diagonal-mixed variants provide richer, isotropic frequency coverage and better boundary, shape, and objectness encodings (Liu et al., 3 Feb 2026, Yao et al., 4 Dec 2025).
6. Extensions and Recent Directions
2D RoPE has seen multiple generalizations and practical adaptations:
- Cross-View and Cross-Dimensional Extensions (URoPE): Universal RoPE lifts features to 3D with camera-depth anchors and projects back to 2D for cross-view transformers. This construction is parameter-free, SE(3)-invariant, and compatible with RoPE-optimized kernels (Xie et al., 20 Apr 2026).
- Time-and-Order RoPE (TO-RoPE): For generative recommendation, TO-RoPE jointly encodes ordinal sequence index and wall-clock time via rotary angles, supporting early fusion and split-by-dimension strategies. Empirically, this increases retrieval quality and broadens attention span over both axes (Wei et al., 23 Oct 2025).
- Quaternionic and Non-Commutative Variants: GeoPE fuses SO(3) rotations via Lie algebra averaging, ensuring symmetric treatment of spatial axes and outperforming axial approaches in tasks requiring true geometric awareness (Yao et al., 4 Dec 2025).
- N-Dimensional and Orthogonal-Mixing Frameworks: Complete theoretical unification places all RoPE variants in the context of MASA and Lie-algebra topology, enabling learned basis changes (e.g., via the Cayley transform) for greater coordination between position axes without breaking the essential commutativity/relativity guarantees (Liu et al., 7 Apr 2025).
7. Comparative Overview and Practical Integration
A comparative summary of several prominent 2D RoPE architectures is presented below:
| Method | Generator Structure | Relative Law | Param Count | Empirical Gain |
|---|---|---|---|---|
| Axis-Aligned | Block-diag commuting | Yes | 0 | Baseline; robust, limited |
| Mixed/Flexible | Learned diagonal/mixed | Yes/Partial | 2 | +1–2% on ViT, multi-res |
| ComRoPE | Trainable commuting | Yes | 3 | SOTA; most robust |
| LieRE | Arbitrary SO(2) learning | No | 4 | Flexible, can degrade |
| Spiral RoPE | Multi-directional splits | Yes | 0 | +0.54–0.88% (ImageNet) |
| GeoPE | SO(3) quaternionic mean | Yes (linear) | 0 | +0.3–1.1% (ViT, Swin) |
| DRoPE | Uniform angular block | Yes (angle) | 0 | SOTA for agent heading |
In practical deployment, 2D RoPE variants are typically integrated immediately after the 5 projections at each self-attention layer, with configured frequencies/axes and commutativity constraints enforced at model initialization.
Two-dimensional Rotary Position Embeddings, through rigorous Lie-theoretic foundations and empirical validation across scales and domains, have become the primary standard for robust, scalable, translation-invariant positional encoding in transformer models for vision, geometric, and multi-modal applications, with commutativity, block-diagonal structure, and relative law preservation as the defining technical features (Yu et al., 4 Jun 2025, Schenck et al., 4 Feb 2025, Liu et al., 3 Feb 2026, Liu et al., 7 Apr 2025, Yao et al., 4 Dec 2025).