Unified RoPE for Hybrid Models

Updated 30 December 2025

The paper introduces a unified RoPE framework leveraging Lie group theory to extend rotary positional encoding in hybrid architectures.
It applies unified rotary operations uniformly across Transformers, state-space models, and multimodal networks, ensuring consistent relative-position modeling.
Empirical results demonstrate improvements such as 42.3% faster training and enhanced performance in tasks like retrieval and recommendation.

Unified Rotary Position Embedding (RoPE) for Hybrid Models

Unified Rotary Position Embedding (RoPE) frameworks address the limitations of conventional positional encoding mechanisms in hybrid models, facilitating coherent relative-position modeling across architectures such as Transformers, State Space Models (SSMs), and multimodal networks. Unified RoPE is essential for hybrid systems that integrate explicit self-attention positional encodings (RoPE) and implicit state-space representations, ensuring spectral and geometric compatibility within long-context and cross-modal tasks.

1. Mathematical Formulation and General Principles

Unified RoPE extends the canonical block-diagonal rotation construction of RoPE to multiple coordinate dimensions and modalities, ensuring properties such as relativity (dependency on relative offset only) and reversibility (distinct positions remain recoverable) (Liu et al., 7 Apr 2025). The generalized formulation is grounded in Lie group theory:

For position vector $x \in \mathbb{R}^N$ , select a family of $N$ commuting, independent skew-symmetric generators $B_i \in \mathrm{SO}(d)$ from a maximal abelian subalgebra (MASA).
The unified rotary operator is $R_x = \exp\left(\sum_{i=1}^N x^{(i)} B_i\right)$ with $\dim(\text{MASA}) = \lfloor d/2 \rfloor \ge N$ .
Standard RoPE (1D) corresponds to $N=1$ and $d=2$ (so(2) maximal toral subalgebra). N-dimensional RoPE leverages $N$ independent planes in $SO(2N)$ .

For hybrid models, both queries/keys (Transformer) and auxiliary vectors (SSMs, multimodal) receive identical rotary operations, enabling a unified relative-position spectrum. For a position $m, n$ , the rotary similarity is $q_m^{\top} R_{n-m} k_n$ ; for SSM blocks, the update mirrors this structure with rotated state-space parameters (Wu et al., 11 Jun 2025).

2. Integration into Hybrid Architectures

Unified RoPE achieves architectural consistency by homogenizing positional representations throughout both self-attention and state-space modules (Wu et al., 11 Jun 2025). A typical hybrid (e.g., TransXSSM) comprises:

Explicit block-diagonal rotary embedding $R_{\Theta,m}^d$ for all tokens in both modules.
For attention: $f_Q(q, m) = R_{\Theta,m}^d q$ , $f_K(k, n) = R_{\Theta,n}^d k$ .
For SSM: $f_C(c, m) = R_{\Theta,m}^d c$ , $f_B(b, n) = R_{\Theta,n}^d b$ .
Relative-position scores are unified: $q_m^{\top} R_{n-m} k_n$ (attention), $c_m^{\top} R_{n-m} b_n$ (SSM), with $R_{n-m}$ the same operator.

Pseudocode integration in hybrid stacks is immediate: rotary rotation is a shared preprocessing step for all module vectors, and no additional reparameterization is required to switch between attention and SSM layers.

3. Frequency Allocation and Block Structure

Hybrid rotary schemes for multimodal and long-context tasks require judicious frequency allocation. HoPE (Li et al., 26 May 2025) and MM-RoPE (Yuan et al., 11 Jul 2025) exemplify strategies:

HoPE: Interprets position as $(t, x, y)$ (temporal, spatial-horizontal, spatial-vertical), splits the hidden dimension among these, and critically sets temporal frequencies to zero, i.e., $R_{(t, x, y)} = \operatorname{diag}(I_{|\mathcal{I}_t|}, \text{rot}_x, \text{rot}_y)$ , ensuring semantic preference invariance even for arbitrary long $\Delta t$ .
MM-RoPE: Splits the visual embedding space into meta-blocks, interleaving temporal and spatial rotations within each block for balanced spectral coverage. Scaling factors $(s_t, s_h, s_w)$ are applied to coordinate differences for spectral alignment; distributed block assignment ensures comprehensive frequency range for all axes.

Circle-RoPE (Wang et al., 22 May 2025) further introduces cone-like decoupling by mapping image grid positions to a circular trajectory orthogonal to text positions, minimizing cross-modal token bias and achieving zero per-token distance (PTD).

4. Unification Strategies for Temporal and Ordinal Inputs

Recent advances for generative recommendation and sequence modeling unify ordinal and temporal sources via fused rotary angle computation (Wei et al., 23 Oct 2025):

Early fusion: Rotary angle from both discrete index and timestamp, $\theta_{i,k} = i \omega^p_k + \tau_i \omega^t_k$ ; similarity involves $\cos[(i-j)\omega^p_k + (\tau_i-\tau_j)\omega^t_k]$ , yielding constructive and interference terms.
Split-by-dimension/head: Assign rotary planes or entire attention heads exclusively to index or time frequencies; mix via output projection for hybrid dependencies. Empirical studies indicate that 30–50% split achieves optimal temporal/ordinal capacity.

Dynamic temporal scaling (DTS), as implemented in HoPE, randomizes temporal scale factors during training and inference, ensuring robustness to varying information density and allowing controlled extrapolation.

5. Implementation, Computational Design, and Hardware Optimization

Unified RoPE frameworks are computationally efficient and hardware-friendly (Chiang et al., 3 Dec 2025, Wu et al., 11 Jun 2025):

The fused rotary kernel (UniQL, Nemotron-H, Bamba-v2) collapses unpacking, matmul, gather, and 2×2 rotation into a single GPU kernel under INT4 quantization and pruning, leveraging symmetric channel sorting to preserve adjacency required by rotary operations.
With symmetric sorting, each dimension pair $(2i, 2i+1)$ remains contiguous post-pruning, ensuring rotary-safe projection.
The kernel incurs negligible memory overhead and achieves 10% improvement in per-token latency. Throughput gains (2.7×–3.4×) are observed across pruned/quantized hybrid models, with no measurable accuracy loss.

Unified rotary implementations are compatible with standard Transformer stacks (e.g., Llama3, Falcon, Qwen2), SSMs (Mamba2), and multimodal hybrids; they require only the replacement of the rotary angle computation without changes to MLPs, layer norms, or other core weights.

6. Empirical Validation and Comparative Benchmarks

Unified RoPE designs consistently yield improvements or maintain performance relative to standard positional encodings:

In hybrid models (TransXSSM), unified rotary yields 42.3% faster training and 29.5% faster inference than pure Transformer models at 4K context; perplexity improves by 2.4%, with scaling gains exceeding alternatives such as pure SSMs or vanilla Transformers (Wu et al., 11 Jun 2025).
HoPE delivers up to +22.23% absolute gain on needle-in-a-haystack video retrieval, with longer contexts and greater model size magnifying the advantage (Li et al., 26 May 2025).
MM-RoPE achieves faster convergence and lower cross-entropy loss than baseline RoPE variants for video generation, with strong text–image alignment and no added runtime cost (Yuan et al., 11 Jul 2025).
Circle-RoPE eliminates cross-modal positional bias (PTD=0), with staggered-layer application outperforming uniform grid or all-circle variants by ~0.3–0.4 points on vision–language metrics (Wang et al., 22 May 2025).
Split TO-RoPE instantiations outperform embedding/bias baselines in sequential recommendation (HR@10 boost from 0.3341 to 0.3406, NDCG@10 from 0.2027 to 0.2059 on ML-20M) (Wei et al., 23 Oct 2025).

Comparison Table: Empirical Results Across Unified RoPE Designs

Model / Method	Metric	Standard RoPE	Unified RoPE Variant	Gain
TransXSSM-1.3B	PPL (8k eval)	8.38	8.18	−2.4%
HoPE (Video-MME, 32k)	Retrieval (%)	59.13	59.44	+0.31
HoPE (V-NIAH, 32k)	Retrieval (%)	52.00	63.56	+22.23
MM-RoPE (GenEval)	Alignment	0.580	0.601	+0.021
TO-RoPE (HR@10, ML-20M)	Recommendation	0.3341	0.3406	+0.0065

7. Limitations, Controversies, and Open Directions

Unified RoPE approaches require careful choice of frequency allocation, scaling, and block arrangement for each modality or hybrid. For extremely long sequences ( $\gg$ 256k), sliding window or layer-wise alternation may necessitate learnable or multi-scale masking (Yang et al., 30 Jan 2025). Theoretical work on cross-dimensional mixing via learned bases ( $Q \in SO(d)$ ) has clarified architectural tradeoffs but incurs extra parameters and computation.

Current research investigates soft interpolation between RoPE and NoPE, learnable basis rotation, and integration with alternative schemes such as Alibi or state-space duals. As hybrid models extend to multi-million-token contexts and increasingly complex multimodal fusion, new attention primitives and geometric positional frameworks are anticipated.

References

"HoPE: Hybrid of Position Embedding for Length Generalization in Vision-LLMs" (Li et al., 26 May 2025)
"Rotate Both Ways: Time-and-Order RoPE for Generative Recommendation" (Wei et al., 23 Oct 2025)
"Rope to Nope and Back Again: A New Hybrid Attention Strategy" (Yang et al., 30 Jan 2025)
"Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-LLMs" (Wang et al., 22 May 2025)
"Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Encoding" (Liu et al., 7 Apr 2025)
"UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs" (Chiang et al., 3 Dec 2025)
"TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding" (Wu et al., 11 Jun 2025)
"Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective" (Yuan et al., 11 Jul 2025)