Papers
Topics
Authors
Recent
2000 character limit reached

Learnable Angle Matrices (ComRoPE)

Updated 16 December 2025
  • Learnable Angle Matrices (ComRoPE) are trainable skew-symmetric matrices that generalize rotary positional encodings to ensure robust relative positioning.
  • They employ axial-partition and linearly dependent parameterizations to guarantee commutativity, supporting both one- and multi-dimensional inputs.
  • ComRoPE leverages efficient block-structured Givens rotations and round-robin scheduling for scalable transformer implementations on high-resolution tasks.

Learnable Angle Matrices, or Commuting Rounds of Parallelized Elementary rotations (ComRoPE), generalize rotary positional encodings (RoPE) by parameterizing position-dependent rotations using trainable angle (skew-symmetric) matrices with strict commutativity requirements. ComRoPE is designed to address the limitations of fixed rotary mechanisms in transformer architectures, guaranteeing robustness to position offsets, scalability to high-resolution domains, and adaptability across one- and multi-dimensional structured inputs. The theoretical foundation relies on the algebra of commuting skew-symmetric generators that define families of orthogonal transformations satisfying essential properties of relative positional encoding, and supports efficient implementation via block-structured Givens rotations or round-robin decompositions. Empirical results demonstrate ComRoPE's superiority on tasks including large-scale image classification and object detection, with strict gains in both in-distribution and out-of-distribution settings (Yu et al., 4 Jun 2025).

1. Theoretical Foundations: Rotary Positional Embedding and the RoPE Equation

Rotary Positional Embedding (RoPE) integrates positional information into the attention mechanism through block-diagonal rotation matrices acting on embedding vectors. Standard RoPE operates by partitioning the embedding dimension dd into d/2d/2 independent 2×22\times2 blocks, each corresponding to a planar rotation with manually designed angle schedules:

R(p)=diag(R1(p),R2(p),,Rd/2(p))SO(d)R(p) = \operatorname{diag}\bigl(R_1(p), R_2(p), \dots, R_{d/2}(p)\bigr) \in SO(d)

with

Rj(p)=(cosθj(p)sinθj(p) sinθj(p)cosθj(p)),θj(p)=p100002j/dR_j(p) = \begin{pmatrix} \cos\theta_j(p) & -\sin\theta_j(p) \ \sin\theta_j(p) & \cos\theta_j(p) \end{pmatrix}, \quad \theta_j(p) = \frac{p}{10000^{2j/d}}

The attention calculation involves rotated queries and keys q=R(pq)qq' = R(p_q)q, k=R(pk)kk' = R(p_k)k, yielding

qk=qR(pq)R(pk)k.q'^\top k' = q^\top R(p_q)^\top R(p_k) k.

A central constraint for robust, relative positional encoding is the RoPE equation: R(pq)R(pk)=R(pkpq)for all positionsR(p_q)^\top R(p_k) = R(p_k - p_q) \quad \text{for all positions} which guarantees attention only depends on relative offset, thereby ensuring shift-invariance and scalability to arbitrary input lengths or resolutions.

ComRoPE generalizes R(p)R(p) to

R(x;A)=exp(i=1NAixi)R(x;\mathcal{A}) = \exp\left(\sum_{i=1}^N A_i x_i\right)

where A={A1,,AN}\mathcal{A} = \{A_1,\dots,A_N\} are NN real skew-symmetric matrices in Rd×d\mathbb{R}^{d\times d}, and xRNx\in\mathbb{R}^N encodes position along each axis. The necessity and sufficiency theorem (Theorem 3.1, (Yu et al., 4 Jun 2025)) asserts that R(x;A)R(x;\mathcal{A}) satisfies the RoPE equation for arbitrary offsets if and only if all AiA_i commute: [Ai,Aj]=0[A_i, A_j]=0 for all i,ji, j. This property enables consistent relative position encoding for both 1D and multidimensional data.

2. Constructing and Parameterizing Learnable Commuting Angle Matrices

ComRoPE provides explicit parameterizations that guarantee commutativity, leading to two principal variants:

Axial-Partition (AP) Parameterization

Partition dd into mm blocks of size bb, d=mbd = m b. For each axis i{1,,N}i\in\{1,\dots,N\} and block j{1,,m}j\in\{1,\dots,m\},

Bij={PjPj,ji (mod N) 0b×b,otherwiseB_{ij} = \begin{cases} P_j - P_j^\top, & j\equiv i\ (\mathrm{mod}\ N) \ 0_{b\times b}, & \text{otherwise} \end{cases}

where PjP_j is an unconstrained trainable b×bb\times b matrix. Then,

Ai=diag(Bi1,Bi2,,Bim).A_i = \operatorname{diag}(B_{i1}, B_{i2},\dots, B_{im}).

Only one axis contributes a non-zero skew block per partition, enforcing [Ai,Aj]=0[A_i,A_j]=0.

Linearly Dependent (LD) Parameterization

Produce a single base skew-symmetric matrix S=PPS = P - P^\top and per-axis scalars θi\theta_i. Form

Bi=θiS,Ai=diag(Bi,,Bi)B_{i} = \theta_i S, \quad A_i = \operatorname{diag}(B_i, \ldots, B_i)

with d/bd/b repeats per axis. Since all AiA_i are scalar multiples of a common block, they trivially commute.

Both parameterizations scale efficiently, use learnable blocks of small matrix size, and result in O(db)O(d b) or O(d(b+N/b))O(d (b+N/b)) free parameters.

3. Efficient Implementation: Givens Rotations, Round-Robin Scheduling, and GPU Utilization

ComRoPE extends the classical FFT-like or round-robin decompositions of orthogonal matrices using angle-parameterized Givens rotations (Mathieu et al., 2014, Hamze, 2021). For head dimension n=2Ln=2^L, a rotation matrix QQ is represented as

QQ(L)Q(L1)Q(1)=s=1Lp=1n/2Ris,p,js,p(θs,p)Q \approx Q^{(L)} Q^{(L-1)} \cdots Q^{(1)} = \prod_{s=1}^{L} \prod_{p=1}^{n/2} R_{i_{s,p},j_{s,p}}(\theta_{s,p})

Here, each Q(s)Q^{(s)} is a block-sparse matrix applying independent planar rotations (Givens) on disjoint pairs. The index-pair schedule is engineered in a "butterfly" or "round-robin" pattern for maximal parallelism; with n/2n/2 rotations per layer and L=log2nL=\log_2 n layers, the approach yields O(nlogn)O(n\log n) total operations for forward and backward passes (Mathieu et al., 2014).

In the round-robin method (Hamze, 2021), all n(n1)/2n(n-1)/2 Givens rotations are organized into n1n-1 blocks of n/2n/2 non-overlapping pairs, admitting O(n)O(n) sequential depth in forward computation and O(nlogn)O(n\log n) for backpropagation, ideally suited to GPU architectures.

Common implementation steps:

  • Store and update only the angle parameters and small per-block intermediates.
  • Forward and backward passes update activations layer/block by layer/block without forming full dense matrices.
  • After each update, re-project 2×22\times2 blocks to SO(2)SO(2) to ensure orthogonality.
  • Runtime overhead is negligible (2nlogn\approx 2n\log n FLOPs per vector for n=64n=64).

4. Integration into Transformers and Attention Mechanisms

Within transformer-based models, ComRoPE replaces fixed RoPE with dynamically learned, commuting angle-based rotations. For each token (or patch) and each axis, positional coordinates modulate the associated angle matrices:

For each position p, Mp=i=1NPp,iAi, Rp=exp(Mp)\text{For each position } p,~ M_p = \sum_{i=1}^N P_{p,i} A_i,~ R_p = \exp(M_p)

The query/key embeddings at position pp are updated to Qp=RpQpQ_p^\prime = R_p Q_p, Kp=RpKpK_p^\prime = R_p K_p. Standard attention proceeds using QQ^\prime, KK^\prime:

Attn(Qp,Kp,Vp)=softmax((Qp)(Kp))Vp\text{Attn}(Q_p, K_{p'}, V_{p'}) = \operatorname{softmax}\bigl( (Q_p^\prime) (K_{p'}^\prime)^\top \bigr) V_{p'}

All rotation parameters are shared across the batch and sequence dimensions, with gradients accumulated during backpropagation. Practical instantiations initialize θs,p\theta_{s,p} using FFT "twiddle" factors, RoPE sinusoids, or uniform randomization.

5. Empirical Performance and Robustness

ComRoPE's effectiveness is most evident in settings where positional robustness and extrapolation to out-of-distribution resolutions or input lengths are critical.

Classification and Detection

On ImageNet-1K using ViT-B/16, at 224×224224 \times 224 input, ComRoPE-LD yields 65.49% top-1 accuracy (+2.4% absolute over LieRE), and at 512×512512 \times 512 extrapolated resolution, 55.29% (+2.9% absolute) (Yu et al., 4 Jun 2025). Object detection experiments (MS COCO, ViT-S backbone) yield ComRoPE-LD at 44.7 AP, slightly outperforming LieRE at 44.5 AP, using approximately half the additional parameters.

Ablation and Stress Testing

  • ComRoPE variants display invariance to coordinate shifts, while non-commuting formulations (e.g., LieRE) degrade under uniform coordinate perturbations.
  • Optimal block size bb balances performance and computational cost, peaking near b=8b=8.
  • Robustness to training-time position perturbation is intrinsic in ComRoPE; gains from artificial perturbation are minimal compared to vulnerable baselines (APE +19.5% vs. ComRoPE-LD +2.9%).

6. Generalization Properties and Relation to Prior Work

ComRoPE formally subsumes both absolute and rotary positional encoding as special cases:

  • If all Ai=0A_i=0, the mechanism reduces to absolute position encoding.
  • With b=2b=2 and fixed 2×22\times2 skew-symmetric blocks, standard RoPE is recovered. Thus, ComRoPE constitutes a strict superset of previous rotary-based encodings (Yu et al., 4 Jun 2025).

Expressivity is governed by parameter count and block size; larger blocks allow richer, higher-dimensional transformations at increased computational overhead.

7. Future Directions and Open Problems

Key avenues for future research highlighted include:

  • Developing optimized routines for small-matrix exponentiation (exact closed-form, lookup tables) to minimize computational cost for block sizes b8b \leq 8.
  • Investigating weaker commutativity constraints that may permit a wider space of trainable angle matrices.
  • Extending the approach to more general structured data (multi-dimensional grids, point clouds, videos) and scaling to LLMs where efficient exponentiation is paramount.

A plausible implication is that such directions may enhance generalization, facilitate efficient fine-tuning for new modalities, and support architectures requiring relative encoding over long spatial-temporal contexts.


Summary Table: ComRoPE Parameterizations

Variant Structure Parameters (per block)
AP One axis per block dbd \cdot b
LD Shared skew matrix b2+Nb^2 + N

ComRoPE establishes a mathematically principled, scalable, and empirically robust foundation for learnable rotary embeddings in transformer models, unifying previous positional encoding strategies and advancing state-of-the-art accuracy and generalization (Yu et al., 4 Jun 2025, Hamze, 2021, Mathieu et al., 2014).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learnable Angle Matrices (ComRoPE).