Papers
Topics
Authors
Recent
2000 character limit reached

2D Rotary Positional Embedding for Transformers

Updated 3 December 2025
  • 2D RoPE is a positional encoding method that uses Lie-theoretic rotation matrices to inject relative 2D spatial information into token embeddings.
  • It features both axial and mixed-frequency constructions to enable efficient integration into Vision Transformers and robust extrapolation to unseen resolutions.
  • Empirical studies show that 2D RoPE improves accuracy in classification, segmentation, and dense prediction tasks with negligible computational overhead.

A 2D Rotary Positional Embedding (RoPE) is a positional encoding mechanism for Transformer architectures that injects relative two-dimensional location information into token representations via axis-wise or mixed-frequency rotations. Developed to generalize the block-diagonal rotation of 1D RoPE—well-established in LLMs—to vision and multimodal domains, 2D RoPE enables precise and efficient spatial encoding, systematic extrapolation to unseen resolutions, and seamless integration into multi-head self-attention modules. The construction of 2D RoPE is grounded in Lie-theoretic principles, ensures relativity and reversibility, and is supported by both empirical and mathematical research across visual transformers, multimodal systems, agent modeling, and robotic perception.

1. Theoretical Foundations and Properties

2D RoPE is formalized through the lens of Lie group and Lie algebra theory, providing a principled basis for rotational positional encoding in higher (e.g., 2D, 3D) input spaces. Two core properties are central to all valid 2D RoPE constructs:

  • Relativity: For all positions (i1,j1)(i_1, j_1) and (i2,j2)(i_2, j_2), the rotation matrices satisfy R(i1,j1)R(i2,j2)=R(i2i1,j2j1)R_{(i_1, j_1)}^\top R_{(i_2, j_2)} = R_{(i_2-i_1, j_2-j_1)}. This ensures that the query-key attention scores depend only on relative spatial offsets, not absolute positions.
  • Reversibility (Injectivity): R(i1,j1)=R(i2,j2)R_{(i_1, j_1)} = R_{(i_2, j_2)} only if (i1,j1)=(i2,j2)(i_1, j_1) = (i_2, j_2), preserving unique encodings for unique positions (Liu et al., 7 Apr 2025).

These properties require that, within the so(4)\mathfrak{so}(4) Lie algebra, the generators B1B_1 and B2B_2 associated with the xx and yy axes must commute and span a maximal abelian subalgebra (MASA). The canonical instantiation block-diagonalizes into two independent 2×22 \times 2 rotation planes; more expressive inter-axis coupling is achieved by a learned orthogonal basis transformation QQ, preserving all group-theoretic properties (Liu et al., 7 Apr 2025, Ostmeier et al., 14 Jun 2024, Schenck et al., 4 Feb 2025).

2. Mathematical Formulation

Let a query/key embedding have head dimension dd, divisible by four. Typical 2D RoPE constructs the positional rotation matrix as follows:

  • Axial 2D–RoPE (Block-Diagonal, “Pure”): Divide the d/2d/2 complex channels between xx and yy. For t=0,,d/41t = 0, \dotsc, d/4 - 1,

Raxial(n)2t=eiθtpnx,Raxial(n)2t+1=eiθtpnyR_{axial}(n)_{2t} = e^{i \theta_t p^x_n}, \quad R_{axial}(n)_{2t+1} = e^{i \theta_t p^y_n}

with frequencies θt=100t/(d/4)\theta_t = 100^{-t/(d/4)} (Heo et al., 20 Mar 2024, Liu et al., 7 Apr 2025).

  • Mixed-Frequency 2D–RoPE (RoPE-Mixed): Introduce learnable per-head frequency vectors θx,θyRd/2\theta^x, \theta^y \in \mathbb{R}^{d/2},

Rmixed(n)t=exp[i(θtxpnx+θtypny)]R_{mixed}(n)_t = \exp\left[i(\theta_t^x p_n^x + \theta_t^y p_n^y)\right]

This variant enables encoding of all possible offset directions (including diagonals) (Heo et al., 20 Mar 2024).

Both variations ultimately perform an elementwise complex (or real-valued) multiplication of each query/key embedding with the rotation for its (x,y)(x, y) location, transforming qnq_n, kmk_m into qn=qnR(n)q'_n = q_n \circ R(n), km=kmR(m)k'_m = k_m \circ R(m), and computing attention scores via Re[qnkm]\operatorname{Re}\left[q'_n k'^{*}_m\right].

In the Lie-theoretic formalism, the most general 2D RoPE is expressed as

R(i,j)=exp(iB1+jB2),[B1,B2]=0R_{(i, j)} = \exp(i B_1 + j B_2), \quad [B_1, B_2] = 0

where B1B_1 and B2B_2 are commuting 4×44 \times 4 skew-symmetric generators, potentially transformed as Qdiag(J(iθ),0)QQ\,\mathrm{diag}(J(i\theta), 0)\,Q^\top and Qdiag(0,J(jθ))QQ\,\mathrm{diag}(0, J(j\theta))\,Q^\top to model axis interaction (Liu et al., 7 Apr 2025, Schenck et al., 4 Feb 2025).

The infinitesimal generators can be arranged as block-diagonal matrices to yield a rotation of arbitrary axis, block, and frequency allocation across the embedding (Ostmeier et al., 14 Jun 2024, Schenck et al., 4 Feb 2025).

3. Integration in Vision Transformer Architectures

2D RoPE directly replaces or augments the positional encoding step in Vision Transformers (ViT), Swin Transformers, and related image or spatially-structured models:

  • Indexing: Tokens are assigned 2D coordinates (i,j)(i, j) on a grid (typically flattening in row-major order, n=iW+jn = iW + j).
  • Embeddings: Each token's query/key is rotated according to its (i,j)(i, j) position and the chosen frequencies.
  • Attention: Attention computation proceeds using dot products of the rotated queries and keys, guaranteeing that attention weights are modulated by relative spatial offsets (Heo et al., 20 Mar 2024, Hsu et al., 11 May 2025).

In practical implementations, rotation matrices are cached or constructed per head/layer/resolution, and the per-token cost is negligible (<0.1%<0.1\% extra FLOPs, <0.01%<0.01\% additional parameters for ViT-S/B). The RoPE module does not mandate changes to the backbone training pipeline and is compatible with dense prediction, segmentation, detection, and windowed attention (Heo et al., 20 Mar 2024, Hsu et al., 11 May 2025).

4. Generalizations and Modal Extensions

The explicit Lie-algebraic construction admits further generalization and flexible adaptation:

  • STRING: Provides universal, separable, translationally invariant PEs for dcd_c-dimensional coordinates, showing that all block-diagonal RoPEs are special cases of exponential maps generated by commuting skew-symmetric matrices (Schenck et al., 4 Feb 2025).
  • LieRE: Parametrizes and learns more general (potentially non-separable) 2D/3D rotations, achieving higher capacity and extension to arbitrary input dimensionality; block structure is employed for computational efficiency (Ostmeier et al., 14 Jun 2024).
  • Directional RoPE (DRoPE): Extends 2D RoPE to encode agent heading via a 2π2\pi-periodic block-diagonal, crucial for maintaining invariance under angular wrap-around in trajectory modeling (Zhao et al., 19 Mar 2025).
  • Spherical RoPE: Adapts RoPE to spherical coordinates by constructing 3×33 \times 3 blocks corresponding to latitude and longitude, avoiding frequency scaling as angles are taken in natural units (radians), with utility in geographic transformer architectures ("geotokens") (Unlu, 23 Mar 2024).
  • VideoRoPE: Further extends the block-diagonal concept to 3D (spatio-temporal) settings, carefully allocating frequency bands to spatial and temporal axes, introducing diagonal layout and adjustable temporal spacing. The 2D special case remains directly compatible for image-only models (Wei et al., 7 Feb 2025).

5. Empirical Performance and Ablation Evidence

Quantitative evaluations establish that 2D RoPE consistently outperforms absolute sine/cosine PEs, learnable absolute PEs, and additive/relative bias in diverse vision and dense prediction tasks:

Model/Task Baseline PE RoPE-Mixed Gain
ViT-B/224, I1k APE = 83.4% 83.8% (+0.4 pp)
Swin-B/224, I1k RPB = 83.3% 83.7% (+0.4 pp)
COCO/AP, ViT-B APE = 49.4 51.2 (+1.8 pp)
ADE20k/Vit-B APE = 47.7 49.6 (+1.9 pp)
ADE20k/Swin-S RPB = 50.2 51.1 (+0.9 pp)

These improvements persist under distribution shift and unseen input resolutions due to RoPE's translation and extrapolation guarantees (Heo et al., 20 Mar 2024). Empirical results from the GOOSE segmentation challenge show +0.71+0.71 pp mIoU improvement due to RoPE alone (Hsu et al., 11 May 2025). Ablations indicate that RoPE-Mixed (with learnable frequency vectors) offers systematic advantages over both axial variants and standard additive biases (Heo et al., 20 Mar 2024, Ostmeier et al., 14 Jun 2024).

Data- and compute-efficiency are also observed: LieRE attains similar accuracy levels as absolute PE baselines in approximately 3.5×3.5\times fewer training steps, and the marginal computational overhead per forward pass remains negligible in all tested settings (Ostmeier et al., 14 Jun 2024).

6. Practical Recommendations and Implementation Trade-offs

Best practices supported by systematic paper include:

  • Default to RoPE-Mixed, as it delivers the strongest classification, detection, and segmentation performance across multiple ViT and Swin backbones (Heo et al., 20 Mar 2024).
  • Store per-head, per-layer frequency vectors and recompute the rotation matrices only when the grid shape changes; this amortizes cost.
  • Optionally add an absolute PE (APE) when the primary deployment regime is close to the training distribution, at the cost of some out-of-distribution extrapolation.
  • For sequence lengths or resolutions well outside the training regime, leverage RoPE's inherent extrapolation capability without retraining or fine-tuning (Heo et al., 20 Mar 2024).

7. Theoretical Guarantees and Universality

The mathematical justification for RoPE's design space is established via universality results: every continuously differentiable, translationally invariant positional encoding into an orthogonal group arises as

R(r)=exp(k=1dcLk[r]k)R(r) = \exp\left(\sum_{k=1}^{d_c} L_k [r]_k\right)

for commuting skew-symmetric LkL_k. The block-diagonal construction of 2D RoPE is thus not only computationally optimal but also a maximally expressive class of translationally invariant position encodings under reasonable smoothness conditions. This underpins the robustness, flexibility, and generality observed empirically and makes RoPE an extensive foundation for further research in high-dimensional and structured positional encoding (Schenck et al., 4 Feb 2025, Liu et al., 7 Apr 2025).


References:

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to 2D Rotary Positional Embedding (RoPE).