RoPE-2D: 2D Rotary Positional Encoding

Updated 28 December 2025

RoPE-2D is a rotary positional encoding method that generalizes 1D embeddings to two-dimensional domains using a rigorous Lie-theoretic framework.
It encodes absolute 2D coordinates as orthogonal rotations, ensuring that attention relies solely on relative spatial offsets.
Its efficient, parameter-free integration into Transformer architectures boosts vision task performance, as shown in semantic segmentation benchmarks.

RoPE-2D is a generalization of Rotary Position Embedding (RoPE) to two-dimensional spatial domains, formulated within a rigorous Lie-theoretic framework and implemented as an efficient, relative, and reversible positional encoding for Transformer attention mechanisms. RoPE-2D encodes absolute 2D coordinates (for instance, on image grids or spatial patches) as orthogonal transformations of input vectors, ensuring that attention dot-products depend only on relative spatial offsets. This encoding is a key component in modern 2D vision architectures, enabling strong generalization and parameter efficiency, and supports extensions to inter-dimensional interactions and higher dimensions (Liu et al., 7 Apr 2025, Hsu et al., 11 May 2025, Reid et al., 26 Sep 2025).

1. Lie-theoretic and Algebraic Basis

RoPE-2D is grounded in the theory of Lie groups and Lie algebras, specifically leveraging properties of the special orthogonal group SO(d). For each absolute 2D coordinate $x = (x^{(1)}, x^{(2)}) \in \mathbb{R}^2$ , RoPE-2D assigns an orthogonal matrix $R_x \in {\rm SO}(d)$ that varies continuously with $x$ . The construction is:

$R_x = \exp\left( x^{(1)} B_1 + x^{(2)} B_2 \right) \in {\rm SO}(d)$

where $B_1$ and $B_2$ are commuting, linearly independent, skew-symmetric generators in $\mathfrak{so}(d)$ . The choice of $B_1, B_2$ in a maximal abelian subalgebra (MASA) ensures that RoPE-2D possesses two essential properties:

Relativity: $R_{x_1}^\top R_{x_2} = R_{x_2 - x_1}$ —the dot-product depends only on relative position.
Reversibility: The map $x \mapsto R_x$ is (locally) injective modulo finite periodicity.

For minimal dimension $d=4$ , $B_1$ and $B_2$ naturally span two independent 2-planes, each associated with a spatial axis (Liu et al., 7 Apr 2025).

2. Explicit Block-Diagonal and Complex Representations

In the canonical case, the generators are chosen as:

$B_1 = {\rm diag}(J, 0_2), \qquad B_2 = {\rm diag}(0_2, J)$

where $J = \theta \begin{bmatrix} 0 & -1 \ 1 & 0 \end{bmatrix}$ and $0_2$ is the $2\times 2$ zero matrix. This yields a block-diagonal rotation:

$R_x = \begin{pmatrix} \cos(x^{(1)} \theta) & -\sin(x^{(1)} \theta) & 0 & 0 \ \sin(x^{(1)} \theta) & \cos(x^{(1)} \theta) & 0 & 0 \ 0 & 0 & \cos(x^{(2)}\theta) & -\sin(x^{(2)}\theta) \ 0 & 0 & \sin(x^{(2)}\theta) & \cos(x^{(2)}\theta) \end{pmatrix}$

This can be equivalently implemented using complex pairs, treating the 4-dimensional vector as two complex numbers and applying $z_\ell \mapsto z_\ell e^{ix^{(\ell)}\theta}$ to each (Liu et al., 7 Apr 2025, Hsu et al., 11 May 2025).

3. General 2D Rotary Embedding Construction and Practical Formulation

RoPE-2D extends sinusoidal rotation-based encoding to both axes of a spatial grid (e.g., patch grids in vision). Given model dimension $d$ , two sets of base frequencies are defined:

$\omega^{(r)}_i = 10000^{-2i/d}, \quad \omega^{(c)}_i = 10000^{-2i/d}$

For a patch at row $r$ and column $c$ , the rotation angles per channel $i$ are $\theta^{(r)}_i = r\omega^{(r)}_i$ , $\theta^{(c)}_i = c\omega^{(c)}_i$ , and the embedding is constructed as a sequence of 2D block rotations:

$\mathrm{RoPE2D}(x_{2i}, x_{2i+1}; r, c) = R(\theta^{(c)}_i) R(\theta^{(r)}_i) \begin{bmatrix} x_{2i} \ x_{2i+1} \end{bmatrix}$

In closed form, this simplifies to a single rotation with angle $\theta_i = r\omega_i + c\omega_i$ applied elementwise across embedding pairs (Hsu et al., 11 May 2025).

4. Integration into Transformer Architectures

RoPE-2D is typically incorporated in the computation of self-attention within transformers operating over 2D grids (such as Swin Transformers). For each patch or token, after query/key projections, the RoPE-2D rotation is applied to query and key vectors before the dot-product attention computation:

Partition $d$ -dimensional features into $(d/2)$ 2D blocks.
For each block, compute the rotation angle using patch row and column indices.
Rotate each query/key block by the corresponding angle using a $2 \times 2$ rotation matrix.
Compute standard dot-product attention with the rotated queries and keys.

This approach introduces no additional parameters and negligible computational overhead ( $O(HW d)$ for image grids), with efficient batch implementation via $\sin$ / $\cos$ routines (Liu et al., 7 Apr 2025, Hsu et al., 11 May 2025).

5. Extensions: Inter-dimensional Mixing and Non-Euclidean Domains

RoPE-2D supports generalization in two significant directions:

Inter-dimensional Interactions: By conjugating the block-diagonal generators with a learned orthogonal matrix $Q \in SO(d)$ , inter-dimensional mixing can be performed while preserving relativity and reversibility. The new generators $\tilde{B}_i = Q B_i Q^\top$ yield rotated encodings that support diagonal or more complex axis interactions. Parameterizations of $Q$ (via Cayley transforms or Givens rotations) are learnable and integrated end-to-end (Liu et al., 7 Apr 2025).
Graph and Manifold Extensions: Wavelet-Induced Rotary Encodings (WIRE) demonstrate that RoPE-2D is a special case of a broader family of rotary encodings indexed by Laplacian spectral coordinates. On a grid graph, the WIRE reduction yields exactly standard 2D RoPE; on general graphs, arbitrary node spectral features define the rotation angles, enabling RoPE-like relative encoding on non-Euclidean domains (Reid et al., 26 Sep 2025).

6. Empirical Effectiveness and Generalization

Empirical studies confirm the efficacy of RoPE-2D for 2D vision tasks, notably semantic segmentation. For instance, on the ICRA 2025 GOOSE 2D Semantic Segmentation Challenge, insertion of RoPE-2D into a Swin backbone improved the mean Intersection-over-Union (mIoU) by 0.8 points over a MaskDINO baseline, and further gains were observed when RoPE-2D was combined with color shift correction and label denoising (Hsu et al., 11 May 2025). The approach supports stable training and does not alter convergence or loss landscapes compared to vanilla parameterizations.

RoPE-2D's core strength is its ability to inject explicit, relative 2D position into Q/K dot-products, thereby stabilizing recognition under translation, scale, and grid shifts—a property empirically verified in vision benchmarks, with minimal impact on computational cost. Extrapolative behavior is guaranteed through the relativity property, allowing models to operate effectively on images of unseen resolution or crop (Liu et al., 7 Apr 2025, Hsu et al., 11 May 2025).

7. Theoretical Guarantees and Computational Properties

The two core requirements—relativity and reversibility—uniquely specify the algebraic form of RoPE-2D. The use of block-diagonal (or complex) rotations ensures $O(d)$ FLOPs per token and parameter-efficient implementation, without explicit matrix exponentiation in runtime code. When applied in complex or inter-dimensional form, all algebraic invariants are preserved. Parameterization is gracefully extensible to $N$ dimensions for spatiotemporal or higher-order data (Liu et al., 7 Apr 2025).

A summary table delineating RoPE-2D characteristics:

Property	Guarantee	Source
Relativity	Dot-product encodes only relative positions	(Liu et al., 7 Apr 2025)
Reversibility	(Local) injectivity with linearly independent generators	(Liu et al., 7 Apr 2025)
Efficiency	$O(d)$ cost per token and head	(Liu et al., 7 Apr 2025)
Parameter count	No extra parameters, except optional $Q$	(Liu et al., 7 Apr 2025)
Empirical effect	Robust performance gains in 2D segmentation, stable training	(Hsu et al., 11 May 2025)

RoPE-2D thus establishes a mathematically principled, empirically robust, and computationally practical positional encoding paradigm for spatial transformers, with generalization capabilities extending to both structured (grids) and non-Euclidean (graphs) domains.