Rotary Time Embeddings (RoTE) Overview

Updated 10 March 2026

Rotary Time Embeddings (RoTE) are a family of parameter-free positional encoding schemes that integrate absolute and relative time signals into attention models via geometric rotations.
RoTE achieves translation invariance by mapping sequence indices to block-diagonal rotation matrices, ensuring robust handling of shifted, truncated, and irregular sequences.
RoTE is efficiently implemented with O(Td) complexity and supports multi-axis generalizations, benefiting applications from ASR to time-series forecasting.

Rotary Time Embeddings (RoTE) are a family of positional encoding schemes that inject time, order, or general continuous position information into attention-based models through parameter-free geometric rotations. By mapping temporal data or sequence indices into block-diagonal rotation matrices, RoTE provides both absolute and relative position signals directly within self-attention dot products, enabling robust, translation-invariant representations for a wide array of sequence modeling tasks.

1. Mathematical Formulation of Rotary Time Embeddings

The core construction of RoTE is a block-diagonal rotation applied to embeddings or projected queries/keys. For a sequence element indexed by time or position $t_i$ , the rotary embedding is defined by:

Base frequencies:

$\theta_j = 10000^{-2(j-1)/d}, \quad j=1,\dots,\frac{d}{2}$

where $d$ is the (even) hidden dimension.

Block-diagonal rotation:

$R(t_i) = \mathrm{blockdiag} \left\{ \begin{bmatrix} \cos(t_i \theta_j) & \sin(t_i \theta_j) \ -\sin(t_i \theta_j) & \cos(t_i \theta_j) \end{bmatrix} : j=1,\ldots,\frac{d}{2} \right\}$

Application to queries/keys:

$q_i' = R(t_i) q_i,\quad k_j' = R(t_j) k_j$

with standard attention computed as:

$\mathrm{score}(i,j) = {q_i'}^\top k_j' = q_i^\top R(t_i)^\top R(t_j) k_j = q_i^\top R(t_j - t_i) k_j$

Thus, self-attention is a function purely of the content and the relative displacement $t_j - t_i$ (Gao et al., 2024, Zhang et al., 10 Jan 2025).

2. Key Theoretical Properties

Translation Invariance:

The attention kernel depends only on time gaps: $R(t_i + \sigma)^\top R(t_j + \sigma) = R(t_j - t_i)$ . Any global shift in all timestamps ( $\sigma$ ) leaves model outputs strictly unchanged. This property is essential for modeling Hawkes processes, temporal point processes, and any application where only relative timing is semantically meaningful (Gao et al., 2024).

Relative Position Encoding:

Unlike absolute sinusoidal embeddings, which encode position as a fixed phase and thus entangle the origin, RoTE representations are robust under sequence truncation, extension, or re-anchoring. The attention mechanism’s dependence on $(t_j - t_i)$ ensures consistent behavior on shifted or extrapolated sequences and generalizes to continuous or irregular time (Zivanovic et al., 26 May 2025).

Multi-Dimensional and Fusion Generalizations:

Extensions include:

Axial RoPE for multidimensional (e.g., spatiotemporal or multichannel) coordinates by assigning separate rotation factors per axis, each over a subspace of the embedding (Zivanovic et al., 26 May 2025, Weng et al., 27 Dec 2025, Wang et al., 17 Jun 2025).
Joint rotation via fused or additive angle parameterization, enabling cross-dimensional dependencies (e.g., spatial-temporal rotation in vision, cylindrical electrode geometry in sEMG, index-time fusion for recommendation) (Weng et al., 27 Dec 2025, Wang et al., 17 Jun 2025, Wei et al., 23 Oct 2025).

3. Integration into Model Architectures

RoTE is incorporated directly into the attention mechanism by applying the block-diagonal rotations to the Q/K projections before the dot-product is used for affinity computation. Notable variants and application patterns:

Temporal Point Processes:

RoTHP uses rotary embedding to enforce translation-invariant self-attention for asynchronous event modeling; the rotation is parameterized by actual event timestamps, yielding attention kernels that match the structure of classical Hawkes model log-likelihoods (Gao et al., 2024).

Automatic Speech Recognition (ASR):

In Conformer architectures for ASR, RoTE is used in both streaming and non-streaming scenarios. For streaming, positional offsets are maintained per chunk; for full-context, absolute positions are extended over the sequence (Zhang et al., 10 Jan 2025, Li et al., 2021).

Masked and Self-Supervised Learning:

RoMAE applies continuous-time RoPE for arbitrary real-valued and multidimensional positional inputs, supporting multivariate, irregularly sampled time-series, and images/audio via shared architecture (Zivanovic et al., 26 May 2025).

Multi-Axis and Joint Rotations:
- CyRoPE factorizes the embedding space for separate, simultaneous rotations along temporal (linear) and spatial (cylindrical/annular) axes, critical for tasks like sEMG decoding where sensor topology matters (Weng et al., 27 Dec 2025).
- Spatial–Temporal RoPE fuses spatial and temporal rotations across the full embedding, allowing cross-axis (x, y, t) motion modeling in egocentric vision tasks (Wang et al., 17 Jun 2025).
- TO-RoPE in generative recommenders combines index and wall-clock time, with instantiations including early fusion, split-by-dimension, and split-by-head (Wei et al., 23 Oct 2025).

4. Computational Efficiency and Implementation

Complexity:

Applying RoTE requires $O(T d)$ operations (two fused elementwise cos/sin multiplies per token and dimension), with no extra parameterization or $O(T^2)$ relative-bias tables as in learned relative embeddings (Zhang et al., 10 Jan 2025). Standard dot-product attention and its usual $O(T^2 d)$ cost for full attention are preserved.

Parallelization:

The rotational operation is trivially parallelizable and highly GPU/TPU friendly; it can be fused with the Q/K projection kernels for optimal data locality. RoTE is compatible with optimized attention implementations, facilitating adoption in large-scale models (Zhang et al., 10 Jan 2025, Zivanovic et al., 26 May 2025).

Streaming and Truncation:

Offset-based chunking and on-the-fly composition (for streaming ASR and video) is supported naturally, since rotations are parameterized by absolute or relative positions, and computation can proceed incrementally without additional state (Zhang et al., 10 Jan 2025).

Toolkits:

Public implementations include the SpeechBrain toolkit for ASR (linear drop-in for RelPos), and MAE-style pipelines for time-series, vision, and audio with RoPE-based encodings (Zhang et al., 10 Jan 2025, Zivanovic et al., 26 May 2025).

5. Empirical Performance Across Domains

Temporal Point Processes:

RoTHP with RoTE achieves strict translation invariance in generative modeling, outperforming THP and other neural TPP variants in test log-likelihood, RMSE, and convergence speed, with resilience to timestamp shifts and Gaussian noise. Typical improvements include +0.22 in test LL (Synthetic), +2.19 (Retweet), and lower prediction RMSE (Financial: 0.60 vs. 0.93 for THP) (Gao et al., 2024).

Speech Recognition (ASR):

On diverse benchmarks (LibriSpeech, LibriHeavy, CommonVoice, Voxpopuli), RoTE matches or exceeds RelPos in WER (e.g. LibriSpeech test-clean: 2.00 → 1.96), with training speedups of 13–21% and seamless operation in both streaming/non-streaming modes (Zhang et al., 10 Jan 2025, Li et al., 2021).

Time-Series and Multimodal Data:

RoMAE equipped with continuous-time RoPE surpasses or matches specialized methods (e.g. ATAT, mTAN, S5, ContiFormer) on irregular multivariate time-series and handles other modalities, preserving or increasing performance on standard image/audio tasks. On classification and interpolation of sparse or irregular series, RoMAE achieves significant improvements in F1 and RMSE (Zivanovic et al., 26 May 2025).

sEMG and Spatiotemporal Applications:

In SPECTRE, CyRoPE’s temporal + cylindrical structure yields $R^2=0.7547$ on fine finger decoding (vs. 0.7429 for raw MAE), showing clear superiority over absolute embeddings or 1D RoPE for signal decoding in the presence of sensor topology and movement (Weng et al., 27 Dec 2025). Spatial–Temporal RoPE in EVA02-AT outperforms isolated position encoding schemes by 1.8–3.4% mAP in egocentric video-language tasks, demonstrating robust cross-axis feature modeling (Wang et al., 17 Jun 2025).

Generative Recommendation:

TO-RoPE brings absolute improvements of +0.3–0.6% (HitRate@10) and +0.2–0.4% (NDCG@10) over index-only or time-only RoPE and relative-bias baselines, with architectural flexibility and deployment efficiency in both public (MovieLens-20M) and proprietary datasets (Wei et al., 23 Oct 2025).

6. Limitations, Variants, and Practical Considerations

Limitations:

Quadratic scaling in sequence length (for attention) remains for long input series.
Each sequence with novel or irregular positions requires recomputation of cos/sin tables.
Pure RoTE is strictly relative unless an absolute anchor (e.g. [CLS] token or masked embedding) is provided; introducing such anchors permits recovery of absolute position but breaks strict invariance (Zivanovic et al., 26 May 2025).
Extrapolation outside of trained position ranges lacks explicit causal or extrapolative inductive bias.

Variants and Generalizations:

Fusion with learned angle gates or frequency scales (TO-RoPE, early-fusion, split-dim/head variants) allows specialization to order, time, or axis-specific semantics (Wei et al., 23 Oct 2025).
Axial and joint multi-axis RoPE (EVA02-AT, SPECTRE) enable direct physical modeling (e.g., spatial/temporal, channel/topology) (Wang et al., 17 Jun 2025, Weng et al., 27 Dec 2025).

Implementation Guidelines:

Choose even hidden dimensions; divide subspaces for multi-axis applications.
Frequencies typically follow the $10000^{-2(i-1)/d}$ schedule for each axis.
For practical deployment, precompute cos/sin tables and exploit fused rotation/project kernels.
Toolkit and codebase support are mature for several application domains (SpeechBrain, MAE, Conformer).

7. Impact and Research Significance

RoTE and its multidimensional or fused variants represent a principled, parameter-free approach to encoding time and position in attention-based models. They achieve:

Strict translation/shift invariance and direct modeling of relative displacement.
Robustness to sequence extension, truncation, and continuous/irregular sampling.
Cross-domain generality, supporting audio, vision, time-series, event streams, knowledge graphs, recommendation, and biomedical data.
Competitive or state-of-the-art empirical results with minimal architectural/no parameter overhead.

By abstracting time/position information into geometric rotations of embedding spaces, RoTE has become a foundational component for temporal modeling in modern sequence architectures across multiple research communities (Gao et al., 2024, Zhang et al., 10 Jan 2025, Zivanovic et al., 26 May 2025, Weng et al., 27 Dec 2025, Wang et al., 17 Jun 2025, Wei et al., 23 Oct 2025).