Temporally Shifted RoPE in Transformers

Updated 1 February 2026

Temporally Shifted RoPE (TS-RoPE) is a method that augments traditional rotary position embeddings by encoding both order and wall-clock time as geometric rotations.
It offers instantiations such as early fusion, split-by-dim, and split-by-head, each designed to balance the encoding of sequential and temporal signals.
Empirical results demonstrate that TS-RoPE improves recommendation accuracy by seamlessly incorporating temporal dynamics into transformer-based models.

Temporally Shifted RoPE (TS-RoPE)—more precisely, Time-and-Order RoPE (TO-RoPE)—refers to a class of rotary position embedding strategies for generative recommendation models that simultaneously encode both discrete sequence index and continuous event time as geometric rotations. This approach extends vanilla RoPE, which models solely token order, by integrating wall-clock time directly into the self-attention mechanism of transformer architectures. The principal goals are to enhance the representation of temporal and sequential information in item interaction sequences and to improve prediction accuracy in generative recommendation tasks (Wei et al., 23 Oct 2025).

1. Formal Definitions and Mathematical Foundation

Let $X \in \mathbb{R}^{T \times d_\text{model}}$ be the sequence of input embeddings for a user with history length $T$ . For each attention head $h \in \{1, \ldots, H\}$ , standard projections yield

$Q_h = X W^Q_h, \quad K_h = X W^K_h, \quad V_h = X W^V_h,$

where $W^Q_h, W^K_h \in \mathbb{R}^{d_\text{model} \times d}$ with $d$ (even) being the head dimension.

The discrete sequence index is denoted $i \in \{1, \ldots, T\}$ and event time is represented as a normalized timestamp

$\tau_i = \frac{u_i - u_\text{ref}}{s},$

where $u_i$ is the Unix timestamp for event $i$ , $u_\text{ref}$ is an arbitrary origin, and $s$ is a scaling factor such that $\tau$ and index $i$ are of commensurate magnitudes.

Vanilla RoPE rotates each even/odd channel pair $k$ of $Q$ and $K$ using an angular term $\theta^{\text{pos}}_{i,k} = i \omega^p_k$ , where frequencies follow a geometric progression $\omega^p_k = \text{base}^{-2k/d},\; \text{base}=10{,}000$ . TO-RoPE introduces an additional time-derived angle, $\theta^{\text{time}}_{i,k} = \tau_i \omega^t_k$ , where $\omega^t_k$ comprises a separate geometric frequency ladder.

Depending on the variant, each $[q_{i,2k}, q_{i,2k+1}]^\top$ and $[k_{j,2k}, k_{j,2k+1}]^\top$ is rotated by a $2 \times 2$ matrix $R(\theta)$ with $\theta$ dependent on both index and time. This enables the direct geometric encoding of temporal and sequential cues within the self-attention architecture (Wei et al., 23 Oct 2025).

2. Instantiations: Early Fusion, Split-by-Dim, and Split-by-Head

TO-RoPE comprises three primary instantiations:

2.1 Early Fusion

For each rotation plane $k$ ,

$\theta_{i,k} = i \omega^p_k + \tau_i \omega^t_k,$

and the same angle is used for both sources. The rotation is applied to each even/odd channel pair, so the attention dot product contains the term $\cos\left((i-j)\omega^p_k + (\tau_i - \tau_j) \omega^t_k\right)$ . Early fusion can experience destructive interference between the $\sin$ / $\cos$ terms from index and time signals.

2.2 Split-by-Dimension

Each plane $k$ is gated by $\lambda_k \in \{0,1\}$ , yielding:

$\theta_{i,k} = (1 - \lambda_k) i \omega^p_k + \lambda_k \tau_i \omega^t_k.$

Some planes solely encode order, others only time. The split ratio $\rho = \frac{\text{\# time planes}}{d/2}$ adjusts model capacity—all other operations mirror early fusion, but cross-term interference is avoided.

2.3 Split-by-Head

The $H$ attention heads are partitioned so that $H_p$ heads use index angles only, and $H_t$ heads use time angles:

For $h \in H_p$ : $\theta^{(h)}_{i,k} = i \omega^p_k$
For $h \in H_t$ : $\theta^{(h)}_{i,k} = \tau_i \omega^t_k$

No rotation plane spans both sources within a single head, entirely blocking interference. The split ratio $s = \frac{\text{\# time heads}}{H}$ serves as the main hyperparameter.

3. Integration into Transformer Architectures

TO-RoPE methods fit naturally into the GPT-2 style decoder-only transformer architecture. The embedding rotations are performed inside the multi-head self-attention module after projection to $Q$ and $K$ , and prior to computing scaled dot-products:

No changes to the feed-forward networks or layer normalization.
$V_h$ remains unchanged.
Compatibility with flash-attention and optimized kernels is preserved since only $Q$ and $K$ undergo in-place geometric rotation.
No additional parameter overhead is introduced in the split-by-dim and split-by-head variants beyond the frequency banks for index and time.

4. Hyperparameters and Implementation Details

Key architectural and training details are as follows:

Parameter	Typical Values / Ranges	Comment
$d_\text{model}$	$\approx512$	Model embedding width
$H$ (heads)	$8$	Each with $d=64$ , $d$ even
Layers	$12$	Decoder-only, as per GPT-2
FFN hidden size	$\approx 2048$	Per transformer layer
Sequence length	$50$ (MovieLens-20M), $1024$ (proprietary)	Adjusted for dataset
$\omega^p_k$	$10,000^{-2k/d}$	Index frequency banks
$\omega^t_k$	$\text{base}_t^{-2k/d}$ , $\text{base}_t \in [10^3,10^5]$	Time frequency banks; tuned base
Time scale $s$	chosen so $\tau_i$ ≈ $i$ range	E.g., hours or days
Split ratio	$\rho$ (by dim), $s$ (by head); sweep $\{0.1, 0.3, 0.5, 0.9\}$	Best at $0.3–0.5$
Optimization	AdamW, lr $= 1\times10^{-3}$ , batch $= 128$	With dropout $0.2$
Training regime	Multi-epoch (MovieLens), single-pass (proprietary)	Leave-one-out/daily holdout splits

Appropriate normalization of timestamps and careful selection of the time frequency base $\text{base}_t$ are required to robustly capture the necessary temporal granularity.

5. Empirical Results and Comparisons

TO-RoPE methods have demonstrated superior performance over absolute (learned index/time embeddings) and relative-bias approaches (such as HSTU-style attention biases). Experimental results on both public benchmarks (MovieLens-20M) and proprietary datasets show systematic improvements. For illustrative top-10 metrics, consider:

Model Variant	HR@10 (Prop.)	NDCG@10 (Prop.)	HR@10 (ML-20M)	NDCG@10 (ML-20M)
Learned (APE)	0.5510	0.3818	0.3335	0.2023
HSTU-style Rel. Bias	0.5513	0.3820	0.3341	0.2023
Index-only RoPE	0.5537	0.3841	0.3347	0.2026
Time-only RoPE	0.5568	0.3865	0.3341	0.2027
Early Fusion TO-RoPE	0.5562	0.3855	0.3362	0.2037
Split-by-Head TO-RoPE	0.5582	0.3875	0.3388	0.2048
Split-by-Dim TO-RoPE	0.5582	0.3874	0.3406	0.2059

Performance plateaus for split ratios allocating 30–50% of capacity to time, suggesting robust and stable gains across TO-RoPE instantiations (Wei et al., 23 Oct 2025).

6. Practical Considerations, Limitations, and Implications

TO-RoPE maintains the geometric foundation of rotary position embeddings without resorting to additive biases or explicit feature concatenation. Insertion into existing transformer architectures requires only minimal code changes, and compatibility with efficient attention kernels is preserved.

Notable advantages:

Simultaneous encoding of burstiness (via high-frequency planes), long-range temporal recency (low-frequency planes), and periodicity (via time frequency ladders).
Explicit interpretability of capacity allocation through split ratios on planes or heads.
Robustness to hyperparameter choices around the recommended capacity allocation range.

Key limitations and caveats:

Early fusion may induce interference between index and time encoding in shared planes; split variants are preferred for stability.
Time normalization and appropriate selection of frequency ladders for time are critical for performance.
Choice of dimension/head split ratio is an additional hyperparameter, although default values perform robustly.

A plausible implication is that rotary embedding schemes generalize effectively to domains where both order and timestamp information are central, provided careful partitioning is applied to isolate signal sources.

7. Context and Significance Within Generative Recommendation

TO-RoPE positions rotary position embeddings as a principled and deployment-ready approach for generative recommendation systems, especially in tasks necessitating nuanced modeling of both temporal and sequential signals within user behavior data. The absence of additional learned parameters (in split variants), combined with observed empirical gains, distinguishes TO-RoPE from prior work based on absolute or relative position/time embeddings. The methodology detailed in (Wei et al., 23 Oct 2025) provides a concrete template for future advances in temporal encoding for transformer-based recommendation and potentially other sequence modeling domains.

Markdown Report Issue Upgrade to Chat

References (1)

Rotate Both Ways: Time-and-Order RoPE for Generative Recommendation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporally Shifted RoPE (TS-RoPE).