Rotary-based Unified Encoding
- Rotary-based unified encoding is a framework that parametrizes positions using rotation matrices in embedding spaces, ensuring relative-position awareness.
- It unifies diverse modalities such as text, video, and graphs by applying block-diagonal (or skew-symmetric) rotations across different structural axes.
- The method integrates theoretical guarantees like translation equivariance with empirical performance gains, optimizing transformer architectures.
A rotary-based unified encoding method refers to a parametrization of position (or analogous structural labels) via rotations in the embedding space, aligning all axes, modalities, or domains through a coherent, mathematically grounded framework. Rather than limiting positional information to absolute or relative biases, rotary-based unified methods use block-diagonal (or more generally, skew-symmetric) rotation matrices applied to embeddings, inducing desirable properties such as relative-position awareness, distance-dependent attenuation, and geometric alignment across heterogeneous domains. This approach subsumes and extends traditional RoPE, enables unification across modalities (temporal, spatial, semantic, etc.), and supports both fixed and learned, as well as input- or context-dependent, parametrizations.
1. Conceptual Foundations of Rotary-Based Unified Encoding
At the core, rotary-based unified encoding leverages the algebra of rotation groups (often ) over the embedding space. In the standard form—originally established in Rotary Position Embedding (RoPE)—absolute token indices are mapped to block-diagonal rotation matrices acting on projected features. This yields self-attention logits that depend on relative position via phase differences, ensuring (a) translation equivariance for sequences, and (b) the ability to generalize to unseen context lengths without explicit lookup tables (Su et al., 2021).
Beyond 1D RoPE for text, unified rotary-based variants generalize these principles to joint spatiotemporal axes (Weng et al., 27 Dec 2025, Wang et al., 17 Jun 2025, Liu et al., 17 Feb 2025), spherical/geo encoding (Hu et al., 14 Jan 2026), graphs with arbitrary spectral topologies (Reid et al., 26 Sep 2025), and multi-modal or multi-head adaptive structures (Li et al., 12 Oct 2025). The key mathematical invariance is that inner products after rotary transforms depend only on relative position (or meaningful structural difference), which is critical for domains lacking canonical sequence order (e.g., graphs, causal sets, geospatial data).
2. Mathematical Framework and Core Procedures
All rotary-based unified encoding methods share a defining algebraic procedure: represent positions/labels/structural coordinates as “angles” or tuples thereof, then rotate pairs of embedding dimensions via blockwise orthogonal (or, in generalized cases, skew-symmetric) matrices. Specifically:
- Index parametric rotation:
- For each token with position , assign a frequency vector (fixed or learned per subspace).
- Form rotation angles for each $2$-dim subspace: .
- Apply on .
- Relative-position invariance arises from the compositional identity:
which generalizes to multivariate or matrix-based rotations under commutativity constraints (Yu et al., 4 Jun 2025).
- Fusion across axes proceeds via concatenation or summation of rotation angles for each spatiotemporal/structural axis, respecting geometry (e.g., circular, cylindrical, spherical):
as in CyRoPE (Weng et al., 27 Dec 2025) or joint spatiotemporal RoPE (Wang et al., 17 Jun 2025).
- Higher-order/learnable/generalized variants: introduce trainable, commuting angle matrices (ComRoPE) (Yu et al., 4 Jun 2025) or input/context-mediated rotation schedules (CARoPE, Selective RoPE) (Veisi et al., 30 Jul 2025, Movahedi et al., 21 Nov 2025).
3. Domain Extensions: Spatiotemporal, Manifold, and Graph Topologies
Rotary-based unified encoding has been systematically extended beyond linear or grid sequences:
- Cylindrical/annular topology: In SPECTRE’s CyRoPE (Weng et al., 27 Dec 2025), spatial positions (e.g., sEMG electrode channels) are modeled as points on a circle, with rotary phases increasing proportionally to physical angular separation. The time/channel subspaces are each embedded by respective rotary blocks, factorized and concatenated, yielding a position-augmented embedding that respects the cylindrical sensor topology.
- Video and multimodal settings: VRoPE (Liu et al., 17 Feb 2025) and joint spatiotemporal RoPE (Wang et al., 17 Jun 2025) implement rotation-based position encoding that fuses spatial and temporal dimensions, corrects for attention bias, and enables smooth modality transitions (e.g., video to text). Symmetric pairing of positive/negative directions cancels asymmetric decay in long-range attention (Liu et al., 17 Feb 2025).
- Graphs and arbitrary metric spaces: WIRE (Reid et al., 26 Sep 2025) constructs node-wise rotation angles from the top eigenvectors of the graph Laplacian, applying the same block-diagonal rotary structure. On grid-like graphs, this reduces to Cartesian RoPE; on general topologies, it induces resistance-distance-aware attention decay.
- Geometric and physical manifolds: SpatCode (Hu et al., 14 Jan 2026) encodes spatial (geographic) positions as $3$-vector points on the unit sphere, time as a point on the unit circle, and semantic features as real vectors, concatenating these into a single rotary-encoded feature for unified cosine-based retrieval.
- Hyperbolic/spherical manifolds: 3D-RPE (Ma et al., 2024) extends encoding to tokens viewed as qubit states on the Bloch sphere, decoupling inter- and intra-chunk angles for long-context modeling with controlled decay and higher position resolution.
4. Parametric, Adaptive, and Unified Rotary Generalizations
Several methodologies have generalized the rotary paradigm to address axis-coupling, adaptation, and commutativity constraints:
- ComRoPE (Yu et al., 4 Jun 2025): Generalizes RoPE by replacing fixed frequency blocks with trainable, commuting skew-symmetric matrices. The mathematical requirement is that for all positions, the “RoPE Equation” holds iff the generators commute, ensuring offset-aware equivariance.
- HARoPE (Li et al., 12 Oct 2025): Inserts a learnable linear transform via SVD before the rotary map in each attention head. This enables dynamic frequency reallocation, semantic alignment of rotary planes, and supports non-axis-aligned encoding, while guaranteeing attention still depends only on coordinate offsets.
- Input- and context-adaptive rotations: Selective RoPE and CARoPE (Movahedi et al., 21 Nov 2025, Veisi et al., 30 Jul 2025) generate rotation angles as explicit functions of token embeddings (per-head and/or per-token), replacing static frequencies with frequency bases parameterized by small neural networks, enabling per-head, per-token context-sensitivity within the rotary framework.
- Hybrid and manifold unified RoPE: TransXSSM (Wu et al., 11 Jun 2025) applies the same rotary encoding operator to both transformer attention and state-space model recurrences, maintaining spectral phase continuity across modules. HoPE (Dai et al., 5 Sep 2025) uses hyperbolic (Lorentzian) boosts instead of Euclidean rotations, yielding monotonic, exponentially decaying long-range attention with RoPE as the zero-curvature limit.
5. Practical Integration, Implementation, and Empirical Impact
Rotary-based unified encoding integrates efficiently into modern transformer-style models:
- Implementation: For a -dimensional embedding, split into pairs and apply rotations for each pair. Practical implementations use precomputed sin/cos tables, vectorized “rotate every two” operations, and permit batched execution at cost per token (Weng et al., 27 Dec 2025). For multi-axis settings (spatiotemporal, graph, video), angles are either concatenated or composed additively/multiplicatively.
- Training/Inference: Rotary encodings are differentiable and compatible with both full softmax and linear time attention kernels. Many variants carry no additional trainable parameters (classical RoPE), while adaptive or manifold-based extensions may include small neural networks or matrix parameters.
- Empirical Performance: Consistent accuracy gains are demonstrated across modalities:
- SPECTRE (CyRoPE): improvement of 3-4 points versus absolute PEs, with ablation showing both rotary factorization and frequency-domain pretraining are essential (Weng et al., 27 Dec 2025).
- ComRoPE: top-1 accuracy on ImageNet-1K over state-of-the-art LieRE (Yu et al., 4 Jun 2025).
- HARoPE: Outperforms multi-dimensional RoPE and STRING/RethinkRoPE on ImageNet FID and IN top-1, with gains up to (Li et al., 12 Oct 2025).
- VRoPE: Achieves up to points in long-video retrieval benchmarks (Liu et al., 17 Feb 2025).
- TransXSSM: Unifies SSM and attention for faster training and points accuracy gain at 1.3B model scale (Wu et al., 11 Jun 2025).
- SpatCode: Outperforms filter-based and hybrid multi-index retrieval in both efficiency and recall by eliminating hard filtering via direct rotary encoding (Hu et al., 14 Jan 2026).
- WIRE: 1–3 point gains on graph classification/molecular property prediction, robust to graph topology (Reid et al., 26 Sep 2025).
6. Limitations, Theoretical Guarantees, and Generalization
- Commutativity and consistency: The RoPE Equation dictates that parametric or learned rotary operators must commute to preserve translation invariance and ensure the correct relative position dependence (Yu et al., 4 Jun 2025).
- Resolution and decay: 3D-RPE (Ma et al., 2024) shows that chunk-wise split rotary encoding yields higher resolution under position interpolation and enables controllable long-term decay, outperforming traditional RoPE on long-context NLU tasks.
- Numerical and computational considerations: Angles remain within floating-point stability for realistic sequence lengths, and batched vectorization makes rotary encoding as efficient as a small linear projection (Weng et al., 27 Dec 2025).
- Topology and geometry: Manifold-based encodings (hyperbolic, Bloch sphere, annular/cylindrical) match data geometry more accurately, yielding better synthetic and real-world generalization, e.g., resistance-distance decay on graphs (Reid et al., 26 Sep 2025), causal attenuation in biological data (Xu et al., 20 Sep 2025).
7. Representative Variants and Their Operating Regimes
| Variant | Domain/Geometry | Parametric/Adaptive | Key Features |
|---|---|---|---|
| RoPE | 1D sequences | Fixed frequencies | Relative offset encoding |
| CyRoPE | Spatiotemporal/cylinder | Axis-split, fixed | Temporal and annular spatial rotation |
| ComRoPE | Arbitrary | Trainable, commutative | Matrix exponentials, robust offset consistency |
| HARoPE | Images, N-D | Learnable per-head | SVD semantics, cross-axis coupling |
| Selective RoPE | General | Input/context adaptive | Angle gating, head-specific phase, unifies linear/softmax attention |
| Joint Spatio-temporal RoPE | Video, egocentric | Full-dim coupling | Jointly-rotated, non-axially split embeddings |
| WIRE | Graphs | Spectral basis | Laplacian wavelets, SE(3) invariance |
| HoPE | Text, long-range | Lorentz boosts | Monotonic decay, hyperbolic geometry |
| 3D-RPE | Sequences, Bloch-sphere | 2-angle, chunked | Decoupled intra-/inter-chunk, high position resolution |
| SpatCode | Spatiotemporal retrieval | Rotary circle + sphere | Distance-respecting, unit-norm, seamless plug-in |
All aforementioned methods conform to the defining rotary paradigm, ensuring a unified, geometry-aware, and application-invariant positional encoding framework.
References:
SPECTRE (CyRoPE) (Weng et al., 27 Dec 2025), ComRoPE (Yu et al., 4 Jun 2025), Selective RoPE (Movahedi et al., 21 Nov 2025), VRoPE (Liu et al., 17 Feb 2025), CAPE (Xu et al., 20 Sep 2025), HoPE (Dai et al., 5 Sep 2025), SpatCode (Hu et al., 14 Jan 2026), RoFormer (Su et al., 2021), HARoPE (Li et al., 12 Oct 2025), EVA02-AT (Wang et al., 17 Jun 2025), WIRE (Reid et al., 26 Sep 2025), Of All StrIPEs (Agarwal et al., 7 Apr 2025), 3D-RPE (Ma et al., 2024).