TM-RoPE: Time-Aligned Rotary Position Embedding
- The paper’s main contribution is introducing a continuous-time rotary encoding scheme that synchronizes heterogeneous audio-visual streams by adjusting angular frequencies.
- It employs modality-specific angular frequencies to implicitly align tokens from different frame rates, achieving superior cross-modal fusion accuracy per empirical ablation studies.
- The approach eliminates the need for explicit interpolation or temporal resampling, offering a computationally efficient solution for multimodal sequence modeling.
Time-aligned Multimodal Rotary Position Embedding (TM-RoPE), also termed Temporally-aligned RoPE (TaRoPE), is a position encoding scheme developed to address the frame-rate mismatch problem in multimodal self-attention models, particularly in audio-visual sequence learning. TM-RoPE generalizes Rotary Position Embedding (RoPE) by injecting continuous-time positional information through modality-specific angular frequencies, ensuring implicit synchronization of token representations drawn from streams with heterogeneous sampling rates. This approach enables direct cross-modal interaction based on real-time alignment, without requiring explicit interpolation or temporal realignment operations (Koo et al., 11 Mar 2026).
1. Motivation and High-Level Objective
Conventional audio-visual models often process utterance-level features or, at best, use position-encoded frame-level features without accounting for the fact that audio and video streams typically operate at different frame rates (e.g., 50 Hz for audio, 30 Hz for video). Standard self-attention schemes with position embeddings—such as RoPE—infer relative temporal relationships based on discrete token indices. When modalities are sampled at mismatched rates, tokens that correspond to the same real-world timestamp have different indices, leading to misaligned relative positions across modalities.
TaRoPE introduces a continuous-time formulation for rotary position embeddings. By assigning each modality a base frequency proportional to its sampling rate, tokens that share the same physical timestamp in their original streams are mapped to identical phases in the rotary embedding space. This synchronization allows the self-attention module to focus on actual temporal alignment, improving the modeling of intra- and inter-modal dependencies in the presence of heterogeneous rates (Koo et al., 11 Mar 2026).
2. Mathematical Formulation
TM-RoPE adapts the rotary position encoding paradigm to continuous time by assigning modality-specific angular velocities:
Let denote the model embedding dimension, the number of attention heads, and the dimension per head. Each attention head vector is rearranged into 2D pairs (indices $2i, 2i+1$).
Standard RoPE for a unified stream with base frequency :
applied to query and key sub-vectors as: For a single base frequency, for all .
TM-RoPE for multimodal streams with rates (audio) and (video):
Set
with audio indices and video indices . The rotary embeddings are:
Analogously for keys.
When (tokens represent same real time), then : real-time points in both modalities align in the rotary embedding space.
Self-Attention Dot Product Dependence
The cross-modal attention reduces to a function of time difference: ensuring that attention is based on true temporal difference rather than index offset.
3. Integration into Multimodal Transformers
TM-RoPE is implemented via the following sequence:
- Inputs: (audio), (video).
- Sampling rates: Hz (audio after Wav2Vec2.0), Hz (video after OpenFace).
- Projection: , , for each modality , reshaped to .
- Apply TM-RoPE: For audio, compute phase ; for video, . Rotate each 2D pair in the Q and K vectors accordingly.
- Concatenate streams along time: , .
- Standard multi-head attention:
No additional learnable position embeddings are used; and are fixed by design. Model hyperparameters used in evaluation include , , for CTM projection, AdamW optimizer with learning rate , and 50 epochs (Koo et al., 11 Mar 2026).
4. Implicit Temporal Synchronization Principle
TM-RoPE's design ensures that tokens encoded at the same wall-clock time across modalities are mapped to identical positions in the rotary space, by matching the rate-scaled rotation frequency:
The attention mechanism consequently becomes sensitive only to the real-time difference between tokens: This ensures implicit cross-modal temporal alignment without explicit up/downsampling, differentiable interpolation, or synchronization heuristics.
A plausible implication is a reduction in spurious attention between temporally distant audio and video frames, thus focusing cross-modal fusion on temporally relevant interactions.
5. Empirical Validation and Comparative Analysis
The empirical performance of TM-RoPE was evaluated on the CREMA-D and RAVDESS audio-visual emotion recognition datasets using a unified multimodal self-attention (MSA) architecture. Within the MSA block, TaRoPE enabled attention heads to operate over a shared temporal reference grid, enhancing short-term cross-modal interactions and attenuating noise from temporal misalignment.
The auxiliary Cross-Temporal Matching (CTM) loss, defined with hyperparameters s, , and , encourages feature similarity between embeddings close in real time, further reinforcing temporal coherence.
Ablation results highlight the contributions of TaRoPE and CTM:
| Position Embedding | Accuracy w/o CTM (%) | Accuracy w/ CTM (%) |
|---|---|---|
| Sinusoidal | 88.09 | 88.79 |
| RoPE | 87.76 | 89.00 |
| TaRoPE (TM-RoPE) | 88.95 | 89.49 |
These results indicate that TaRoPE achieves superior cross-modal fusion accuracy relative to both the sinusoidal and standard RoPE baselines, and that CTM loss yields additive improvements to each approach (Koo et al., 11 Mar 2026).
6. Component and Hyperparameter Specification
Key implementation details include:
- Embedding dimension: (total model), per head, heads, with 32 rotary-encoded 2D pairs per head.
- Sinusoidal/parametric rotation: Full RoPE (Su et al., RoFormer) uses a frequency vector for multiple time scales; TaRoPE can simplify to a scalar for all pairs.
- CTM loss projection: .
- Audio/Video preprocessing: Audio downsampled to 50 Hz via Wav2Vec2.0, video Action Units at 30 Hz via OpenFace.
- Optimization: AdamW optimizer, starting learning rate , 50 epochs.
All rotary parameters are fixed by design and sampling rates, with no learnable positional embeddings in TM-RoPE.
7. Distinction from Standard RoPE and Significance
Standard RoPE applies a rotation based on token index, which can lead to temporal misalignment between modalities sampled at different rates. TM-RoPE uniquely rescales one modality's angular velocity to align the position encoding with real time, not with token index. As a result, relative positions are expressed as temporal offsets, not index offsets. This reparameterization is purely multiplicative and does not add trainable parameters or significant computational overhead, but is critical for preserving temporal cues and enabling precise multimodal attention.
Empirical evidence suggests TM-RoPE is effective in sequence tasks where temporal correspondence across modalities is essential, such as audio-visual emotion recognition (Koo et al., 11 Mar 2026). A plausible implication is its applicability to other multimodal domains featuring asynchronous streams, where explicit time alignment is challenging or computationally expensive.