Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temporally Shifted RoPE in Transformers

Updated 1 February 2026
  • Temporally Shifted RoPE (TS-RoPE) is a method that augments traditional rotary position embeddings by encoding both order and wall-clock time as geometric rotations.
  • It offers instantiations such as early fusion, split-by-dim, and split-by-head, each designed to balance the encoding of sequential and temporal signals.
  • Empirical results demonstrate that TS-RoPE improves recommendation accuracy by seamlessly incorporating temporal dynamics into transformer-based models.

Temporally Shifted RoPE (TS-RoPE)—more precisely, Time-and-Order RoPE (TO-RoPE)—refers to a class of rotary position embedding strategies for generative recommendation models that simultaneously encode both discrete sequence index and continuous event time as geometric rotations. This approach extends vanilla RoPE, which models solely token order, by integrating wall-clock time directly into the self-attention mechanism of transformer architectures. The principal goals are to enhance the representation of temporal and sequential information in item interaction sequences and to improve prediction accuracy in generative recommendation tasks (Wei et al., 23 Oct 2025).

1. Formal Definitions and Mathematical Foundation

Let XRT×dmodelX \in \mathbb{R}^{T \times d_\text{model}} be the sequence of input embeddings for a user with history length TT. For each attention head h{1,,H}h \in \{1, \ldots, H\}, standard projections yield

Qh=XWhQ,Kh=XWhK,Vh=XWhV,Q_h = X W^Q_h, \quad K_h = X W^K_h, \quad V_h = X W^V_h,

where WhQ,WhKRdmodel×dW^Q_h, W^K_h \in \mathbb{R}^{d_\text{model} \times d} with dd (even) being the head dimension.

The discrete sequence index is denoted i{1,,T}i \in \{1, \ldots, T\} and event time is represented as a normalized timestamp

τi=uiurefs,\tau_i = \frac{u_i - u_\text{ref}}{s},

where uiu_i is the Unix timestamp for event ii, urefu_\text{ref} is an arbitrary origin, and ss is a scaling factor such that τ\tau and index ii are of commensurate magnitudes.

Vanilla RoPE rotates each even/odd channel pair kk of QQ and KK using an angular term θi,kpos=iωkp\theta^{\text{pos}}_{i,k} = i \omega^p_k, where frequencies follow a geometric progression ωkp=base2k/d,  base=10,000\omega^p_k = \text{base}^{-2k/d},\; \text{base}=10{,}000. TO-RoPE introduces an additional time-derived angle, θi,ktime=τiωkt\theta^{\text{time}}_{i,k} = \tau_i \omega^t_k, where ωkt\omega^t_k comprises a separate geometric frequency ladder.

Depending on the variant, each [qi,2k,qi,2k+1][q_{i,2k}, q_{i,2k+1}]^\top and [kj,2k,kj,2k+1][k_{j,2k}, k_{j,2k+1}]^\top is rotated by a 2×22 \times 2 matrix R(θ)R(\theta) with θ\theta dependent on both index and time. This enables the direct geometric encoding of temporal and sequential cues within the self-attention architecture (Wei et al., 23 Oct 2025).

2. Instantiations: Early Fusion, Split-by-Dim, and Split-by-Head

TO-RoPE comprises three primary instantiations:

2.1 Early Fusion

For each rotation plane kk,

θi,k=iωkp+τiωkt,\theta_{i,k} = i \omega^p_k + \tau_i \omega^t_k,

and the same angle is used for both sources. The rotation is applied to each even/odd channel pair, so the attention dot product contains the term cos((ij)ωkp+(τiτj)ωkt)\cos\left((i-j)\omega^p_k + (\tau_i - \tau_j) \omega^t_k\right). Early fusion can experience destructive interference between the sin\sin/cos\cos terms from index and time signals.

2.2 Split-by-Dimension

Each plane kk is gated by λk{0,1}\lambda_k \in \{0,1\}, yielding:

θi,k=(1λk)iωkp+λkτiωkt.\theta_{i,k} = (1 - \lambda_k) i \omega^p_k + \lambda_k \tau_i \omega^t_k.

Some planes solely encode order, others only time. The split ratio ρ=# time planesd/2\rho = \frac{\text{\# time planes}}{d/2} adjusts model capacity—all other operations mirror early fusion, but cross-term interference is avoided.

2.3 Split-by-Head

The HH attention heads are partitioned so that HpH_p heads use index angles only, and HtH_t heads use time angles:

  • For hHph \in H_p: θi,k(h)=iωkp\theta^{(h)}_{i,k} = i \omega^p_k
  • For hHth \in H_t: θi,k(h)=τiωkt\theta^{(h)}_{i,k} = \tau_i \omega^t_k

No rotation plane spans both sources within a single head, entirely blocking interference. The split ratio s=# time headsHs = \frac{\text{\# time heads}}{H} serves as the main hyperparameter.

3. Integration into Transformer Architectures

TO-RoPE methods fit naturally into the GPT-2 style decoder-only transformer architecture. The embedding rotations are performed inside the multi-head self-attention module after projection to QQ and KK, and prior to computing scaled dot-products:

  • No changes to the feed-forward networks or layer normalization.
  • VhV_h remains unchanged.
  • Compatibility with flash-attention and optimized kernels is preserved since only QQ and KK undergo in-place geometric rotation.
  • No additional parameter overhead is introduced in the split-by-dim and split-by-head variants beyond the frequency banks for index and time.

4. Hyperparameters and Implementation Details

Key architectural and training details are as follows:

Parameter Typical Values / Ranges Comment
dmodeld_\text{model} 512\approx512 Model embedding width
HH (heads) $8$ Each with d=64d=64, dd even
Layers $12$ Decoder-only, as per GPT-2
FFN hidden size 2048\approx 2048 Per transformer layer
Sequence length $50$ (MovieLens-20M), $1024$ (proprietary) Adjusted for dataset
ωkp\omega^p_k 10,0002k/d10,000^{-2k/d} Index frequency banks
ωkt\omega^t_k baset2k/d\text{base}_t^{-2k/d}, baset[103,105]\text{base}_t \in [10^3,10^5] Time frequency banks; tuned base
Time scale ss chosen so τi\tau_iii range E.g., hours or days
Split ratio ρ\rho (by dim), ss (by head); sweep {0.1,0.3,0.5,0.9}\{0.1, 0.3, 0.5, 0.9\} Best at $0.3–0.5$
Optimization AdamW, lr =1×103= 1\times10^{-3}, batch =128= 128 With dropout $0.2$
Training regime Multi-epoch (MovieLens), single-pass (proprietary) Leave-one-out/daily holdout splits

Appropriate normalization of timestamps and careful selection of the time frequency base baset\text{base}_t are required to robustly capture the necessary temporal granularity.

5. Empirical Results and Comparisons

TO-RoPE methods have demonstrated superior performance over absolute (learned index/time embeddings) and relative-bias approaches (such as HSTU-style attention biases). Experimental results on both public benchmarks (MovieLens-20M) and proprietary datasets show systematic improvements. For illustrative top-10 metrics, consider:

Model Variant HR@10 (Prop.) NDCG@10 (Prop.) HR@10 (ML-20M) NDCG@10 (ML-20M)
Learned (APE) 0.5510 0.3818 0.3335 0.2023
HSTU-style Rel. Bias 0.5513 0.3820 0.3341 0.2023
Index-only RoPE 0.5537 0.3841 0.3347 0.2026
Time-only RoPE 0.5568 0.3865 0.3341 0.2027
Early Fusion TO-RoPE 0.5562 0.3855 0.3362 0.2037
Split-by-Head TO-RoPE 0.5582 0.3875 0.3388 0.2048
Split-by-Dim TO-RoPE 0.5582 0.3874 0.3406 0.2059

Performance plateaus for split ratios allocating 30–50% of capacity to time, suggesting robust and stable gains across TO-RoPE instantiations (Wei et al., 23 Oct 2025).

6. Practical Considerations, Limitations, and Implications

TO-RoPE maintains the geometric foundation of rotary position embeddings without resorting to additive biases or explicit feature concatenation. Insertion into existing transformer architectures requires only minimal code changes, and compatibility with efficient attention kernels is preserved.

Notable advantages:

  • Simultaneous encoding of burstiness (via high-frequency planes), long-range temporal recency (low-frequency planes), and periodicity (via time frequency ladders).
  • Explicit interpretability of capacity allocation through split ratios on planes or heads.
  • Robustness to hyperparameter choices around the recommended capacity allocation range.

Key limitations and caveats:

  • Early fusion may induce interference between index and time encoding in shared planes; split variants are preferred for stability.
  • Time normalization and appropriate selection of frequency ladders for time are critical for performance.
  • Choice of dimension/head split ratio is an additional hyperparameter, although default values perform robustly.

A plausible implication is that rotary embedding schemes generalize effectively to domains where both order and timestamp information are central, provided careful partitioning is applied to isolate signal sources.

7. Context and Significance Within Generative Recommendation

TO-RoPE positions rotary position embeddings as a principled and deployment-ready approach for generative recommendation systems, especially in tasks necessitating nuanced modeling of both temporal and sequential signals within user behavior data. The absence of additional learned parameters (in split variants), combined with observed empirical gains, distinguishes TO-RoPE from prior work based on absolute or relative position/time embeddings. The methodology detailed in (Wei et al., 23 Oct 2025) provides a concrete template for future advances in temporal encoding for transformer-based recommendation and potentially other sequence modeling domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporally Shifted RoPE (TS-RoPE).