Papers
Topics
Authors
Recent
Search
2000 character limit reached

Axial Rotary Positional Embeddings

Updated 21 January 2026
  • Axial RoPE is a multi-dimensional extension of rotary positional embeddings that rotates queries and keys, ensuring attention depends solely on relative offsets.
  • It employs block-diagonal orthogonal rotations and independent per-axis transformations to boost expressivity in tasks like NLP, vision, and time-series analysis.
  • The method improves computational efficiency and scalability while offering robustness through learnable variants and group-theoretic generalizations.

Rotary Positional Embeddings (RoPE) constitute a class of norm-preserving multiplicative positional encodings for Transformer architectures. RoPE operates by rotating queries and keys in each attention head within fixed or learned two-dimensional planes, with rotation angle proportional to absolute position and frequency. The rotary mechanism ensures that attention scores depend only on the relative offsets of positions, and enables efficient compatibility with optimized GPU attention kernels. Axial RoPE extends these rotations to multi-dimensional (e.g., 2D for vision, D-dimensional for time-series) or learned subspace settings, increasing positional expressivity while preserving the crucial relative-invariance property. This article details the mathematical formalism of RoPE and axial variants, theoretical properties, implementation strategies, practical impacts in NLP, speech, and vision, recent generalizations, and known limitations.

1. Mathematical Formulation of RoPE and Axial RoPE

RoPE applies a position-dependent block-diagonal orthogonal rotation RtR_t to a dd-dimensional query or key xtx_t at position tt:

Rt=blockdiag(Rt,1,Rt,2,,Rt,d/2)R_t = \mathrm{blockdiag}\left( R_{t,1}, R_{t,2}, \ldots, R_{t,d/2} \right)

where for each ii-th block,

Rt,i=(cos(tθi)sin(tθi) sin(tθi)cos(tθi))R_{t,i} = \begin{pmatrix} \cos(t\theta_i) & -\sin(t\theta_i) \ \sin(t\theta_i) & \cos(t\theta_i) \end{pmatrix}

and θi=100002(i1)/d\theta_i = 10000^{-2(i-1)/d} (Su et al., 2021). The application is performed in O(d)O(d) time via elementwise products and pairwise swaps.

For self-attention, the query qtq_t and key kuk_u are rotated:

q~t=Rtqt,k~u=Ruku\widetilde{q}_t = R_t\, q_t, \qquad \widetilde{k}_u = R_u\, k_u

yielding an attention score:

scoret,u=q~tTk~u=qtTRtTRuku=qtTRutku\mathrm{score}_{t,u} = \widetilde{q}_t^T \widetilde{k}_u = q_t^T R_t^T R_u k_u = q_t^T R_{u-t} k_u

which depends only on relative offset (ut)(u-t).

Axial RoPE generalizes this to multiple axes (e.g., (si(1),...,si(D))(s_i^{(1)}, ..., s_i^{(D)}) for DD axes) by splitting the embedding into DD equal slices and applying independent 1D RoPE per axis and slice (Zivanovic et al., 26 May 2025, Heo et al., 2024):

q~i=[Rsi(1)qi(1);  Rsi(2)qi(2);  ;  Rsi(D)qi(D)]\widetilde{q}_i = [ R^{\,s_i^{(1)}} q_i^{(1)};\; R^{\,s_i^{(2)}} q_i^{(2)};\; \ldots;\; R^{\,s_i^{(D)}} q_i^{(D)} ]

and dot-products encode multidimensional relative position, e.g., (sisj)(s_i - s_j) in each axis.

2. Theoretical Properties and Spectral Interpretation

RoPE induces positional encoding as phase shifts in the embedding space; in complex notation, each pair transformed as zkeiθktzkz_k \mapsto e^{i\theta_k t} z_k (Ruscio et al., 2024), so attention scores decompose into a Fourier-series expansion:

scoret,uk=1d/2[qt(2k)ku(2k)+qt(2k+1)ku(2k+1)]cos(θk(ut))\mathrm{score}_{t,u} \propto \sum_{k=1}^{d/2} \left[ q_t^{(2k)} k_u^{(2k)} + q_t^{(2k+1)} k_u^{(2k+1)} \right] \cos \left( \theta_k (u - t) \right)

The frequencies θk\theta_k control decay of token interactions over distance and encode various memory scales analogous to a bank of fixed sinusoidal filters. Nonlinearities in softmax and feed-forward layers generate higher-order harmonics and interference, but there is no true wavelet basis induced—RoPE remains a Fourier mechanism (Ruscio et al., 2024).

RoPE's design ensures that dot-products are exclusively sensitive to relative positional offsets; this eliminates the need for storage of O(N2)O(N^2) position-bias matrices and preserves compatibility with flash-attention and kernel-fusion methods on GPU (Zhang et al., 10 Jan 2025).

3. Axial, Learned, and Generalized Rotary Embeddings

Axial RoPE applies independent rotations along each spatial or semantic axis, e.g., time and frequency in audio, height and width in images, or multiple coordinates in time-series (Heo et al., 2024, Zivanovic et al., 26 May 2025). This is achieved by splitting embedding dimensions and applying rotations with axis-specific frequencies.

Group-theoretic generalizations (Multiplicative GRAPE (Zhang et al., 8 Dec 2025), ComRoPE (Yu et al., 4 Jun 2025)) formalize rotary embedding as a one-parameter subgroup action in SO(d)\mathrm{SO}(d), generated by block-diagonal or learned-skew matrices. The “RoPE Equation” R(x)R(y)=R(yx)R(x)^\top R(y) = R(y-x) is satisfied if and only if rotation matrices commute pairwise—a necessary and sufficient condition for scalable, offset-consistent rotary parameterizations. ComRoPE introduces trainable commuting angle matrices, allowing the rotary mechanism to adapt its rotational subspaces and frequencies, resulting in improved accuracy and robust coordinate extrapolation in ViTs (e.g., +2.9 % ImageNet-1K @ 512², (Yu et al., 4 Jun 2025)). Multiplicative GRAPE covers both canonical RoPE and axial (learned-subspace) RoPE by varying the underlying subspace basis BB and frequency spectrum per attention head (Zhang et al., 8 Dec 2025).

4. Implementation and Computational Considerations

RoPE is integrated by replacing the positional-bias step in the attention mechanism with the rotary transform (Su et al., 2021, Zhang et al., 10 Jan 2025). For each batch, time, head, and dimension, the transformation is:

1
2
3
4
5
6
def apply_rotary_pos_emb(x, sin, cos):
    # x: [B,T,H,D]
    x1, x2 = x[..., ::2], x[..., 1::2]
    x_rot = torch.cat([x1 * cos - x2 * sin,
                      x2 * cos + x1 * sin], dim=-1)
    return x_rot
where sin, cos are precomputed tables for all positions and frequencies.

Axial RoPE requires splitting the model dimension and precomputing cos/sin tables for all positions in each axis; computational overhead is negligible with vectorized implementation (1%\ll1\% of backbone FLOPs in ViT-B, (Heo et al., 2024)). In speech and time-series, RoPE is applied to frame- or patch-level embeddings with the same block structure.

ComRoPE and GRAPE variants require additional parameters for commuting skew matrices and matrix exponentials, but retain O(nd)O(nd) cost via block-wise optimization strategies (Yu et al., 4 Jun 2025, Zhang et al., 8 Dec 2025).

5. Empirical Performance and Applications

RoPE demonstrates consistent improvements or parity with existing position embedding schemes in diverse modalities:

  • Automatic Speech Recognition (ASR): Conformer encoder-decoder models with RoPE match or outperform Relative Position Embedding (RelPOS) across LibriSpeech, Libriheavy, and CommonVoice (+0.02–0.25 WER absolute), with reduced training time (up to 21%21\% faster in GPU-hours, (Zhang et al., 10 Jan 2025)).
  • Vision Transformers: Axial RoPE enables precise extrapolation from trained to unseen image resolutions, surpassing absolute and relative position bias schemes by +0.3+0.3–$2$ pp in multi-classification, detection, and segmentation metrics (Heo et al., 2024). ComRoPE further improves robustness and coordinate shift invariance (Yu et al., 4 Jun 2025).
  • Irregular Time-Series: Rotary Masked Autoencoders with axial RoPE outperform specialist architectures (e.g., TST, mTAN, S5) in classification and regression on DESC ELAsTiCC, Pendulum, ICU, and synthetic tasks, while maintaining performance on images and audio (Zivanovic et al., 26 May 2025). Learned embeddings (e.g., [CLS]) break strict relative-position invariance.
  • LLMs and Retrieval: Analyses demonstrate that at very long context, high-frequency rotary dimensions are systematically under-utilized, limiting retrieval capacity and suggesting frequency capping or adaptive rotation schemes (Chiang et al., 16 Feb 2025).

6. Limitations, Extensions, and Known Issues

RoPE is subject to several theoretical and empirical constraints:

  • Dimension inefficiency: In long-context LLMs, high-frequency rotary dimensions undergo excessive rotation, leading to “dead” dimensions and wasted head capacity (Chiang et al., 16 Feb 2025).
  • Causal mask distortion: Interaction with the causal mask in decoder architectures induces position-dependent patterns that favor nearby keys and distort RoPE's relative scores into non-relative ones (Kim et al., 25 Sep 2025).
  • Entanglement of content and position: Standard RoPE encodes content (“what”) and position (“where”) jointly; tasks requiring independent matching benefit from decoupled schemes such as PoPE (Gopalakrishnan et al., 5 Sep 2025). PoPE and TAPA replace fixed rotations with content-aware phases or softplus-magnitude embeddings, eliminating distance bias, improving extrapolation, and outperforming RoPE in symbolic-music, genomics, and large-scale language modeling (stable perplexity up to 64K64\,\mathrm{K} tokens, (Yu et al., 16 Sep 2025)).
  • Expressive limitations: Standard RoPE is limited to commuting block-diagonal (d/2d/2 planar) subgroups. Extensions via ComRoPE and GRAPE introduce more expressive learned or coupled subspaces at modest extra computational cost.

7. Future Directions and Open Problems

Research directions include:

In summary, axial rotary positional embeddings represent a mature, theoretically principled, and empirically validated solution for efficient, robust, and scalable position encoding in both language and vision Transformer models. Recent advances in trainable subspaces, content-phase decoupling, and robust extrapolation substantially expand their applicability and resolve known limitations.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Axial Rotary Positional Embeddings (RoPE).