Learnable Angle Matrices (ComRoPE)

Updated 16 December 2025

Learnable Angle Matrices (ComRoPE) are trainable skew-symmetric matrices that generalize rotary positional encodings to ensure robust relative positioning.
They employ axial-partition and linearly dependent parameterizations to guarantee commutativity, supporting both one- and multi-dimensional inputs.
ComRoPE leverages efficient block-structured Givens rotations and round-robin scheduling for scalable transformer implementations on high-resolution tasks.

Learnable Angle Matrices, or Commuting Rounds of Parallelized Elementary rotations (ComRoPE), generalize rotary positional encodings (RoPE) by parameterizing position-dependent rotations using trainable angle (skew-symmetric) matrices with strict commutativity requirements. ComRoPE is designed to address the limitations of fixed rotary mechanisms in transformer architectures, guaranteeing robustness to position offsets, scalability to high-resolution domains, and adaptability across one- and multi-dimensional structured inputs. The theoretical foundation relies on the algebra of commuting skew-symmetric generators that define families of orthogonal transformations satisfying essential properties of relative positional encoding, and supports efficient implementation via block-structured Givens rotations or round-robin decompositions. Empirical results demonstrate ComRoPE's superiority on tasks including large-scale image classification and object detection, with strict gains in both in-distribution and out-of-distribution settings (Yu et al., 4 Jun 2025).

1. Theoretical Foundations: Rotary Positional Embedding and the RoPE Equation

Rotary Positional Embedding (RoPE) integrates positional information into the attention mechanism through block-diagonal rotation matrices acting on embedding vectors. Standard RoPE operates by partitioning the embedding dimension $d$ into $d/2$ independent $2\times2$ blocks, each corresponding to a planar rotation with manually designed angle schedules:

$R(p) = \operatorname{diag}\bigl(R_1(p), R_2(p), \dots, R_{d/2}(p)\bigr) \in SO(d)$

with

$R_j(p) = \begin{pmatrix} \cos\theta_j(p) & -\sin\theta_j(p) \ \sin\theta_j(p) & \cos\theta_j(p) \end{pmatrix}, \quad \theta_j(p) = \frac{p}{10000^{2j/d}}$

The attention calculation involves rotated queries and keys $q' = R(p_q)q$ , $k' = R(p_k)k$ , yielding

$q'^\top k' = q^\top R(p_q)^\top R(p_k) k.$

A central constraint for robust, relative positional encoding is the RoPE equation: $R(p_q)^\top R(p_k) = R(p_k - p_q) \quad \text{for all positions}$ which guarantees attention only depends on relative offset, thereby ensuring shift-invariance and scalability to arbitrary input lengths or resolutions.

ComRoPE generalizes $R(p)$ to

$R(x;\mathcal{A}) = \exp\left(\sum_{i=1}^N A_i x_i\right)$

where $\mathcal{A} = \{A_1,\dots,A_N\}$ are $N$ real skew-symmetric matrices in $\mathbb{R}^{d\times d}$ , and $x\in\mathbb{R}^N$ encodes position along each axis. The necessity and sufficiency theorem (Theorem 3.1, (Yu et al., 4 Jun 2025)) asserts that $R(x;\mathcal{A})$ satisfies the RoPE equation for arbitrary offsets if and only if all $A_i$ commute: $[A_i, A_j]=0$ for all $i, j$ . This property enables consistent relative position encoding for both 1D and multidimensional data.

2. Constructing and Parameterizing Learnable Commuting Angle Matrices

ComRoPE provides explicit parameterizations that guarantee commutativity, leading to two principal variants:

Axial-Partition (AP) Parameterization

Partition $d$ into $m$ blocks of size $b$ , $d = m b$ . For each axis $i\in\{1,\dots,N\}$ and block $j\in\{1,\dots,m\}$ ,

$B_{ij} = \begin{cases} P_j - P_j^\top, & j\equiv i\ (\mathrm{mod}\ N) \ 0_{b\times b}, & \text{otherwise} \end{cases}$

where $P_j$ is an unconstrained trainable $b\times b$ matrix. Then,

$A_i = \operatorname{diag}(B_{i1}, B_{i2},\dots, B_{im}).$

Only one axis contributes a non-zero skew block per partition, enforcing $[A_i,A_j]=0$ .

Linearly Dependent (LD) Parameterization

Produce a single base skew-symmetric matrix $S = P - P^\top$ and per-axis scalars $\theta_i$ . Form

$B_{i} = \theta_i S, \quad A_i = \operatorname{diag}(B_i, \ldots, B_i)$

with $d/b$ repeats per axis. Since all $A_i$ are scalar multiples of a common block, they trivially commute.

Both parameterizations scale efficiently, use learnable blocks of small matrix size, and result in $O(d b)$ or $O(d (b+N/b))$ free parameters.

3. Efficient Implementation: Givens Rotations, Round-Robin Scheduling, and GPU Utilization

ComRoPE extends the classical FFT-like or round-robin decompositions of orthogonal matrices using angle-parameterized Givens rotations (Mathieu et al., 2014, Hamze, 2021). For head dimension $n=2^L$ , a rotation matrix $Q$ is represented as

$Q \approx Q^{(L)} Q^{(L-1)} \cdots Q^{(1)} = \prod_{s=1}^{L} \prod_{p=1}^{n/2} R_{i_{s,p},j_{s,p}}(\theta_{s,p})$

Here, each $Q^{(s)}$ is a block-sparse matrix applying independent planar rotations (Givens) on disjoint pairs. The index-pair schedule is engineered in a "butterfly" or "round-robin" pattern for maximal parallelism; with $n/2$ rotations per layer and $L=\log_2 n$ layers, the approach yields $O(n\log n)$ total operations for forward and backward passes (Mathieu et al., 2014).

In the round-robin method (Hamze, 2021), all $n(n-1)/2$ Givens rotations are organized into $n-1$ blocks of $n/2$ non-overlapping pairs, admitting $O(n)$ sequential depth in forward computation and $O(n\log n)$ for backpropagation, ideally suited to GPU architectures.

Common implementation steps:

Store and update only the angle parameters and small per-block intermediates.
Forward and backward passes update activations layer/block by layer/block without forming full dense matrices.
After each update, re-project $2\times2$ blocks to $SO(2)$ to ensure orthogonality.
Runtime overhead is negligible ( $\approx 2n\log n$ FLOPs per vector for $n=64$ ).

4. Integration into Transformers and Attention Mechanisms

Within transformer-based models, ComRoPE replaces fixed RoPE with dynamically learned, commuting angle-based rotations. For each token (or patch) and each axis, positional coordinates modulate the associated angle matrices:

$\text{For each position } p,~ M_p = \sum_{i=1}^N P_{p,i} A_i,~ R_p = \exp(M_p)$

The query/key embeddings at position $p$ are updated to $Q_p^\prime = R_p Q_p$ , $K_p^\prime = R_p K_p$ . Standard attention proceeds using $Q^\prime$ , $K^\prime$ :

$\text{Attn}(Q_p, K_{p'}, V_{p'}) = \operatorname{softmax}\bigl( (Q_p^\prime) (K_{p'}^\prime)^\top \bigr) V_{p'}$

All rotation parameters are shared across the batch and sequence dimensions, with gradients accumulated during backpropagation. Practical instantiations initialize $\theta_{s,p}$ using FFT "twiddle" factors, RoPE sinusoids, or uniform randomization.

5. Empirical Performance and Robustness

ComRoPE's effectiveness is most evident in settings where positional robustness and extrapolation to out-of-distribution resolutions or input lengths are critical.

Classification and Detection

On ImageNet-1K using ViT-B/16, at $224 \times 224$ input, ComRoPE-LD yields 65.49% top-1 accuracy (+2.4% absolute over LieRE), and at $512 \times 512$ extrapolated resolution, 55.29% (+2.9% absolute) (Yu et al., 4 Jun 2025). Object detection experiments (MS COCO, ViT-S backbone) yield ComRoPE-LD at 44.7 AP, slightly outperforming LieRE at 44.5 AP, using approximately half the additional parameters.

Ablation and Stress Testing

ComRoPE variants display invariance to coordinate shifts, while non-commuting formulations (e.g., LieRE) degrade under uniform coordinate perturbations.
Optimal block size $b$ balances performance and computational cost, peaking near $b=8$ .
Robustness to training-time position perturbation is intrinsic in ComRoPE; gains from artificial perturbation are minimal compared to vulnerable baselines (APE +19.5% vs. ComRoPE-LD +2.9%).

6. Generalization Properties and Relation to Prior Work

ComRoPE formally subsumes both absolute and rotary positional encoding as special cases:

If all $A_i=0$ , the mechanism reduces to absolute position encoding.
With $b=2$ and fixed $2\times2$ skew-symmetric blocks, standard RoPE is recovered. Thus, ComRoPE constitutes a strict superset of previous rotary-based encodings (Yu et al., 4 Jun 2025).

Expressivity is governed by parameter count and block size; larger blocks allow richer, higher-dimensional transformations at increased computational overhead.

7. Future Directions and Open Problems

Key avenues for future research highlighted include:

Developing optimized routines for small-matrix exponentiation (exact closed-form, lookup tables) to minimize computational cost for block sizes $b \leq 8$ .
Investigating weaker commutativity constraints that may permit a wider space of trainable angle matrices.
Extending the approach to more general structured data (multi-dimensional grids, point clouds, videos) and scaling to LLMs where efficient exponentiation is paramount.

A plausible implication is that such directions may enhance generalization, facilitate efficient fine-tuning for new modalities, and support architectures requiring relative encoding over long spatial-temporal contexts.

Summary Table: ComRoPE Parameterizations

Variant	Structure	Parameters (per block)
AP	One axis per block	$d \cdot b$
LD	Shared skew matrix	$b^2 + N$

ComRoPE establishes a mathematically principled, scalable, and empirically robust foundation for learnable rotary embeddings in transformer models, unifying previous positional encoding strategies and advancing state-of-the-art accuracy and generalization (Yu et al., 4 Jun 2025, Hamze, 2021, Mathieu et al., 2014).

Markdown Upgrade to Chat

References (3)

ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices (2025)

Fast Approximation of Rotations and Hessians matrices (2014)

Parallelized Computation and Backpropagation Under Angle-Parametrized Orthogonal Matrices (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learnable Angle Matrices (ComRoPE).

Learnable Angle Matrices (ComRoPE)

1. Theoretical Foundations: Rotary Positional Embedding and the RoPE Equation

2. Constructing and Parameterizing Learnable Commuting Angle Matrices

Axial-Partition (AP) Parameterization

Linearly Dependent (LD) Parameterization

3. Efficient Implementation: Givens Rotations, Round-Robin Scheduling, and GPU Utilization

4. Integration into Transformers and Attention Mechanisms

5. Empirical Performance and Robustness

Classification and Detection

Ablation and Stress Testing

6. Generalization Properties and Relation to Prior Work

7. Future Directions and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Learnable Angle Matrices (ComRoPE)

1. Theoretical Foundations: Rotary Positional Embedding and the RoPE Equation

2. Constructing and Parameterizing Learnable Commuting Angle Matrices

Axial-Partition (AP) Parameterization

Linearly Dependent (LD) Parameterization

3. Efficient Implementation: Givens Rotations, Round-Robin Scheduling, and GPU Utilization

4. Integration into Transformers and Attention Mechanisms

5. Empirical Performance and Robustness

Classification and Detection

Ablation and Stress Testing

6. Generalization Properties and Relation to Prior Work

7. Future Directions and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research