ComRoPE: Scalable Rotary Position Encoding

Updated 9 November 2025

ComRoPE is a framework for trainable positional encoding in Transformers that replaces fixed rotations with learnable, commuting skew-symmetric matrices.
It preserves relative offset invariance by enforcing the commutativity of rotation generators, ensuring robust performance across sequential and multidimensional data.
Empirical results demonstrate that variants like ComRoPE-LD outperform traditional methods, achieving higher accuracy and enhanced robustness in diverse applications.

ComRoPE (Commuting Rotary Position Embedding) is a framework for positional encoding in Transformers that generalizes Rotary Positional Encoding (RoPE) by replacing fixed, hand-designed rotations with trainable, higher-dimensional rotations represented by commuting skew-symmetric matrices. This approach creates a scalable, robust, and theoretically principled method for embedding positional information in models handling sequential, spatial, or general multidimensional data. ComRoPE preserves the crucial “relative offset” property underpinning RoPE’s robustness, while enabling greater expressiveness and improved empirical performance in high-dimensional contexts.

1. Motivation and Limitations of Prior Methods

Absolute Positional Encoding (APE), such as sinusoidal encoding, is fixed after initialization and cannot be adapted during training. APE's fixed spectrum prevents generalization to longer or shifted input sequences and does not support learning of positional frequency content. Standard RoPE, as introduced in RoFormer, encodes positional information by applying a 2D rotation $R(\theta)$ to each (typically 2-dimensional) slice of the query/key. Here, $\theta$ is a deterministic function of the position index. The corresponding rotation matrix is

$R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \ \sin\theta & \cos\theta \end{pmatrix}$

This method is efficient and robust to absolute position offsets, but it is fundamentally limited by:

The use of 2D rotations (low expressivity in higher dimensions),
Manually fixed, non-trainable angles,
Restricted ability to extend to general rotation groups without losing offset-robustness.

The motivation for ComRoPE is to devise a parameterization of RoPE that is i) fully trainable, ii) robust to input offsets (shift-invariance), and iii) scalable to higher-dimensional embeddings.

2. Formalization: The RoPE Equation

ComRoPE is grounded in a formal definition of rotary positional encoding. Let $f: \mathbb{R}^d \times \mathbb{R}^N \to \mathbb{R}^d$ insert position $\mathbf{x} \in \mathbb{R}^N$ into query vector $\mathbf{q} \in \mathbb{R}^d$ , and $\rho(\mathbf{q},\mathbf{k}) = \mathbf{q}^\top \mathbf{k}$ be the standard dot-product similarity. The model requires a matrix-valued function $\mathbf{R}_f : \mathbb{R}^N \to \mathbb{R}^{d\times d}$ such that:

$f(\mathbf{q},\mathbf{x}) = \mathbf{R}_f(\mathbf{x})\mathbf{q}$ ,
$\rho(\mathbf{q},\mathbf{k}) = \mathbf{q}^\top\mathbf{k}$ ,
$g(\mathbf{q},\mathbf{k},\mathbf{x}-\mathbf{y}) = \mathbf{q}^\top\mathbf{R}_f(\mathbf{y}-\mathbf{x})\mathbf{k}$ captures the relative-positional similarity.

For RoPE to provide offset-invariant attention, the following “RoPE Equation” must hold (Proposition 2.1): $\boxed{ \mathbf{R}_f(\mathbf{x})^\top \mathbf{R}_f(\mathbf{y}) = \mathbf{R}_f(\mathbf{y}-\mathbf{x}) \quad \forall \mathbf{x},\mathbf{y} }$ This guarantees that relative attention depends only on positional differences, not on absolute positions.

3. The Commutativity Constraint

ComRoPE parameterizes $\mathbf{R}_f(\mathbf{x})$ using $N$ skew-symmetric matrices $\mathcal{A} = \{A_1, \dots, A_N\}$ : $\mathbf{R}(\mathbf{x}; \mathcal{A}) = \exp\left(\sum_{i=1}^N x_i A_i\right)$ The central result (Theorem 3.1) establishes that the RoPE Equation holds for all $\mathbf{x},\mathbf{y}$ if and only if all $A_i$ pairwise commute: $\exp{\left(-\sum_i x_iA_i\right)}\,\exp{\left(\sum_i y_iA_i\right)} = \exp\left(\sum_i (y_i-x_i) A_i\right) \iff A_i A_j = A_j A_i,\,\,\forall i,j$ This requirement is both necessary and sufficient. Commuting generators ensure exact offset-robustness because the matrix exponential exactly factorizes without higher-order cross-terms.

4. Parameterizations: Trainable Commuting Angle Matrices

Two distinct parameterizations are proposed to enforce the commutativity of skew-symmetric matrices:

4.1 Axial-Partition (ComRoPE-AP)

The embedding dimension $d$ is divided into $m$ blocks of size $b$ ( $d = mb$ ).
Each block $j$ with $j\in\{1,\dots,m\}$ is associated with a trainable $P_j \in \mathbb{R}^{b\times b}$ .
For each axis $i\in\{1,\dots,N\}$ , the $A_i$ are block-diagonal: $B_{ij} = \begin{cases} P_j - P_j^\top, &\text{if } j \equiv i \pmod N \ 0, & \text{otherwise} \end{cases}$ Each $A_i$ has at most one nonzero skew-symmetric block per position. All $A_i$ are block-diagonal and commute.

4.2 Linearly-Dependent (ComRoPE-LD)

Learn a single base skew-symmetric $S = P - P^\top \in \mathbb{R}^{b\times b}$ and, for each axis $i$ , a scalar $\theta_i$ .
Set $A_i = \theta_i S$ . Since all $A_i$ are scalar multiples of $S$ , they trivially commute.

Both constructions solve the RoPE Equation and guarantee offset-robustness.

5. Theoretical Foundations

Supporting lemmas demonstrate that for $A,B$ ,

$e^{A x} e^{B y} = e^{A x + B y}\ \forall x, y \iff AB = BA$

This generalizes to any set $\{A_i\}$ . Therefore, any collection of pairwise commuting skew-symmetric matrices produces a position-dependent transformation $\mathbf{R}(\mathbf{x})$ that meets the RoPE Equation. The standard RoPE is a special case where all $A_i$ are $2\times2$ blocks with hand-designed $\theta$ .

The theoretical framework further justifies that if $A_i=0$ for all $i$ , attention reduces to unrotated, standard dot-product attention, and if $b=2$ and $P_j$ is fixed to the standard rotation generator, vanilla RoPE is recovered.

6. Empirical Performance

ComRoPE was evaluated on various benchmarks:

Method	2D Class. @224 (ImageNet-1K, ViT-B/16)	2D Class. @512	MS COCO Object Detection (ViT-S)	3D Classification (UCF-101)
APE	∼58.8%	N/A	AP = 44.0	Improved robustness
Vanilla RoPE	∼63.1%	N/A	N/A	Improved robustness
LieRE	64.4%	61.2%	AP = 44.5	Improved robustness
ComRoPE-AP	65.3%	N/A	N/A	Improved robustness
ComRoPE-LD	65.5%	62.6%	AP = 44.7 (+0.2)	Improved robustness

ComRoPE-LD surpasses LieRE by 1.6% at training resolution and by 2.9% at higher resolution.
For object detection (MS COCO), ComRoPE-LD yields +0.2 AP over LieRE.
For 3D classification (UCF-101), ComRoPE variants maintain improved robustness under varying resolution.

These results establish that ComRoPE’s learnable, commuting-rotation approach produces consistent accuracy gains and stabilization as input resolution increases.

7. Generalization, Practical Recommendations, and Resources

ComRoPE unifies multiple positional encoding schemes:

If all $A_i=0$ , recovers standard dot-product attention.
Setting block size to 2 and $P_j$ as the canonical generator recovers original RoPE.
Allows richer, learnable feature rotations in higher dimensions, which are optimized via backpropagation.

Practical implementation considerations include:

For images, positional coordinates are best represented in relative, normalized scale.
Centering patch coordinates and introducing synthetic perturbations at training further enhance robustness.
Block size $b$ balances rotation expressiveness and computational cost; empirical evidence suggests $b=8$ is effective.
An open-source reference implementation is available at https://github.com/Longin-Yu/ComRoPE.

ComRoPE offers a scalable, flexible, and rigorously justified method for positional encoding in Transformers, particularly beneficial for contexts requiring high-dimensional, trainable, and offset-robust representations.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to ComRoPE.