2D-RoPE: Rotational Encoding for 2D Data
- 2D-RoPE is a positional encoding method that generalizes rotary encoding to two-dimensional and spherical data by applying block-diagonal rotation matrices to token embeddings.
- It leverages axial decomposition and compositional rotations to precisely encode relative positions, ensuring the attention mechanism captures spatial relationships with minimal computational overhead.
- Empirical studies show that 2D-RoPE improves performance in vision, geospatial modeling, and agent-based tasks, while variants like Spiral RoPE and GeoPE further enhance directional and geometric expressiveness.
2D Rotary Positional Embeddings (2D-RoPE) generalize the rotary positional encoding paradigm from one-dimensional token sequences to inputs with intrinsic two-dimensional or even spherical geometry. This extension enables Transformers to represent relative positions or angular separations directly within their key/query projections, thereby aligning the neural attention mechanism with the spatial or geospatial structure of input data such as images, spatial agent states, or geo-located observations.
1. Mathematical Foundations and Construction
The canonical 2D-RoPE formulates positional encoding as a series of axis-aligned and/or jointly parameterized planar rotations in embedding space. For vision or spatial tasks, the most common implementation assigns each input patch, agent, or geotoken a coordinate (Cartesian grid or, for geospatial data, latitude and longitude ). The token embedding is partitioned into $2$ (or generally ) equal subspaces, each of which is operated on by a position-dependent block-diagonal rotation matrix.
- Axial decomposition (Cartesian grids): For even , the embedding is split into - and -halves. Each half is further subdivided into pairs, and for patch coordinates , each subvector is rotated by or , with frequency schedule :
and similarly for (Heo et al., 2024, Zivanovic et al., 26 May 2025, Ostmeier et al., 2024).
- Composition: The full rotation is the product , leveraging the fact that these block-diagonal operators commute.
- Spherical/geographic (Geotransformers): For spherical data, each geotoken position is mapped by composition of a latitude -axis tilt and longitude -axis sweep:
where and are rotation matrices as per SO(3) conventions; -dimensional embeddings are formed by block-diagonal stacking over blocks (Unlu, 2024).
2. Implementation in Transformer Architectures
Integrating 2D-RoPE requires minimal architectural change. At each layer, token representations are projected to queries and keys. Rotary position encoding is applied by multiplying these vectors by the appropriate block-diagonal rotation as determined by the token's coordinates.
- Pseudocode (Canonical Linear Grid):
- Split each embedding into axis-wise subspaces.
- For each token, precompute sine/cosine phases for each axis and frequency.
- Apply blockwise rotations to each subvector pair as dictated by its positional scalar.
- Use the rotated Q/K in the attention computation:
- Spherical RoPE (Geotransformer): Each -vector is split into blocks, each transformed by (Unlu, 2024).
- Continuous and Non-Integral Coordinates: 2D-RoPE extends directly to continuous-valued positions, essential for applications in irregular grids or agent-based modeling (Zivanovic et al., 26 May 2025, Zhao et al., 19 Mar 2025).
3. Theoretical Properties and Relative Position Encoding
2D-RoPE ensures that, due to the group-structure of rotation matrices, the attention dot-product depends only on relative positional differences:
$Q'_i^\top K'_j = Q_i^\top R_{2D}(x_i, y_i)^T R_{2D}(x_j, y_j) K_j$
Given the commutative structure, , guaranteeing strict relative positional encoding.
For spherical data, the inner product's dependence on geodesic separation follows from the orthonormal property of (Unlu, 2024).
- No Parameter/Memory Blowup: The per-token overhead is ; total memory grows linearly in sequence length, matching the vanilla Transformer.
4. Extensions and Variants
2D-RoPE admits several generalizations to enhance geometric fidelity or directional expressiveness:
| Variant | Embedding Block | Geometry Captured |
|---|---|---|
| Axial RoPE | (per axis) | Axis-aligned displacements |
| RoPE-Mixed | (per-axis, learnable rates) | Oblique directions, via learnable frequency pairs (Heo et al., 2024) |
| Spiral RoPE | Multi-directional | Oblique directions, via G-way directional split and projection (Liu et al., 3 Feb 2026) |
| GeoPE | quaternionic | Symmetric Euclidean 2D rotations (commutative, shape-aware) (Yao et al., 4 Dec 2025) |
| SO(3) RoPE | (spherical) | Spherical geometry (e.g., Earth's surface) (Unlu, 2024) |
| DRoPE | for direction | Periodic angular information for headings (Zhao et al., 19 Mar 2025) |
| LieRE | Higher so() blocks | Arbitrary dimension/algebraic coupling (Ostmeier et al., 2024) |
Spiral RoPE partitions the embedding into groups, each encoding displacements along a direction uniformly sampled on the circle, thus covering all spatial Fourier directions and resolving the axis-alignment limitation of plain 2D-RoPE. GeoPE constructs a symmetric rotation in quaternion space, using the Lie algebraic mean to ensure isotropy with respect to height/width, eliminating sequential proximity artifacts (Yao et al., 4 Dec 2025).
5. Empirical Evaluation and Application Domains
2D-RoPE is widely adopted in computer vision, geospatial modeling, time-series, and agent interaction tasks. Standard axial 2D-RoPE consistently outperforms 1D RoPE or absolute positional encoding in image classification (ImageNet-1k), object detection (COCO), and segmentation (ADE-20k), with observed gains of 1–2% top-1 accuracy and 1–2 mIoU or AP points (Heo et al., 2024, Liu et al., 3 Feb 2026, Yao et al., 4 Dec 2025). Spiral RoPE and GeoPE further enhance performance, particularly at high resolution and for tasks requiring geometric locality or orientation awareness, with additional improvements up to ~2–3% absolute (Liu et al., 3 Feb 2026, Yao et al., 4 Dec 2025).
In geospatial transformers, spherical RoPE supports predictive learning of real-world great-circle distances, with models converging 2–3× faster and to lower loss when true coordinates are encoded (Unlu, 2024). For agent-centric trajectory models, DRoPE offers competitive minADE and realism metrics without incurring quadratic memory growth (Zhao et al., 19 Mar 2025).
6. Practical Considerations, Computational Overhead, and Limitations
The computational overhead of 2D-RoPE is modest—approximately double that of 1D-RoPE, as independent or joint rotations must be computed per coordinate axis or direction. In practice, this cost is negligible in standard Transformer workloads. No intermediate pairwise tensors are required, unlike classical relative position encoding (RPE) approaches.
Accurate geometric coupling is nontrivial: naive axis decomposition cannot distinguish between spatially distant tokens on adjacent rows (a "false neighbor" effect). Geometric-coupled embeddings (GeoPE, Spiral RoPE) are superior in preserving the 2D manifold and resolving such artifacts (Yao et al., 4 Dec 2025, Liu et al., 3 Feb 2026).
Limitations include reduced expressivity for non-Euclidean domains unless the appropriate generalization (e.g., quaternionic or manifold-based rotations) is used. In spherical applications, metric scaling (radian vs. distance) remains an open area for fine-tuning (Unlu, 2024).
7. Summary and Future Directions
2D-RoPE constitutes a principled and empirically validated approach to injecting spatial or geometric inductive bias into Transformer models. Explicitly using block-diagonal rotation matrices parameterized by spatial, spherical, or directional variables allows for exact relative positional encoding, superior extrapolation, and enhanced geometric faithfulness compared to both absolute and 1D positional encodings.
Recent trends include leveraging Lie-theoretic constructions for full high-dimensional coupling (LieRE), extending to continuous and irregular domains, and increasing directionality/flexibility (Spiral RoPE, GeoPE). Empirical results underline the value of geometric positional encoding in both standard vision and emerging spatial/temporal applications. Ongoing developments focus on better integration with manifold data, learnable frequency parameterizations, and further reducing edge effects and artificial locality biases (Yao et al., 4 Dec 2025, Ostmeier et al., 2024, Liu et al., 3 Feb 2026, Unlu, 2024, Heo et al., 2024, Zivanovic et al., 26 May 2025, Zhao et al., 19 Mar 2025).