Interleaved MRoPE in Multimodal Transformers

Updated 13 January 2026

Interleaved MRoPE is a multimodal positional encoding scheme that interleaves axis-specific frequencies to enhance cross-modal integration without extra parameters.
It replaces standard rotary embeddings in Transformers, facilitating improved spatial and temporal modeling along multiple axes for better extrapolation.
In Reed–Muller decoding, the interleaved codex construction reduces query complexity and enhances local decoding efficiency with robust probabilistic error bounds.

Interleaved MRoPE (MRoPE-I) refers to a multimodal positional encoding scheme for deep learning models—particularly vision-language Transformers—and also appears in error correction as an interleaved codex construction for multi-point local decoding in Reed–Muller codes. In both contexts, the “interleaving” mechanism enables simultaneous and independent incorporation of positional information across multiple axes (e.g., text, image height, image width; or multiple codeword coordinates), thereby enhancing representational fidelity, computational efficiency, and extrapolation performance without requiring architectural or parameter changes.

1. Interleaved MRoPE in Vision–Language Transformers

MRoPE-Interleave is designed as a plug-and-play positional encoding mechanism for multimodal models. Traditional Transformers equipped with 1D Rotary Positional Embedding (RoPE) assign a single position index to each token, which is suboptimal for flattened image or video data—rasterization collapses $2$D/3 $D$ spatial geometry into one dimension, impairing modeling of long-range and spatial relationships. Prior approaches like Multi-Head RoPE (MHRoPE) split the embedding frequency spectrum head-wise among axes (temporal/text, height, width), increasing the head count and inhibiting cross-modal channel interaction. MRoPE-I circumvents these limitations by interleaving axis assignments within each head, thereby maintaining full frequency spectrum coverage for all axes and facilitating multimodal positional entanglement without architectural modifications (Huang et al., 27 Oct 2025).

2. Mathematical Definition and Mechanism

Let $d$ be the per-head dimension with $d$ even; thus, each attention head encodes $d/2$ rotary frequency pairs. An interleaving pattern is selected over the three relevant axes:

$I_{\mathrm{t}}$ (temporal/text axis) indices
$I_{\mathrm{h}}$ (image-height axis) indices
$I_{\mathrm{w}}$ (image-width axis) indices

with $|I_{\mathrm{t}}| + |I_{\mathrm{h}}| + |I_{\mathrm{w}}| = d/2$ (e.g., $24:20:20$ for a $64$-dimensional head). For each token $\ell$ —either a text token or an image/video patch indexed by $(t, h, w)$ —three position-IDs are constructed:

$p_\mathrm{t}(\ell)$ : text sequence index or time index
$p_\mathrm{h}(\ell)$ : row index
$p_\mathrm{w}(\ell)$ : column index

A base frequency vector $\{\theta_i\}$ of length $d/2$ is defined, commonly using $\theta_i = 1/10000^{2i/d}$ . For each frequency pair indexed by $i \in \{0,\ldots,d/2-1\}$ , axis assignment is made via: $\text{axis}(i) = \begin{cases} \mathrm{t} & \text{if } i\in I_\mathrm{t} \ \mathrm{h} & \text{if } i\in I_\mathrm{h} \ \mathrm{w} & \text{if } i\in I_\mathrm{w} \end{cases}$ and the rotation angle for token $\ell$ and frequency $i$ is

$\varphi_i(\ell) = \theta_i \cdot p_{\mathrm{axis}(i)}(\ell)$

The standard $2\times2$ rotation (applied to $(q_{2i},q_{2i+1})$ and similarly for the key vector $k$ ) computes: $\begin{pmatrix} \tilde q_{2i} \ \tilde q_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos \varphi_i(\ell) & -\sin \varphi_i(\ell) \ \sin \varphi_i(\ell) & \cos \varphi_i(\ell) \end{pmatrix} \begin{pmatrix} q_{2i} \ q_{2i+1} \end{pmatrix}$ Attention scores $\langle \tilde q, \tilde k \rangle$ are constructed identically to standard RoPE (Huang et al., 27 Oct 2025).

3. Guiding Principles and Theoretical Properties

MRoPE-I is constructed to meet three key requirements:

Positional coherence: Each token’s geometric attributes are mapped directly to rotary angles. All axes (temporal/text, height, width) are entwined at every head, facilitating rich joint modeling of multimodal positional relationships.
Full-frequency utilization: Every axis draws on the complete base frequency spectrum $\{\theta_i\}$ within each attention head; there is no axis-wise frequency starvation.
Preservation of textual priors: For pure text tokens, only the text axis channels are active ( $p_\mathrm{t}=t$ , $p_\mathrm{h}=p_\mathrm{w}=0$ ). This ensures full compatibility with language-only models and direct transfer of pre-trained weights (Huang et al., 27 Oct 2025).

4. Integration and Computational Overhead

MRoPE-Interleave introduces no additional tensor operations beyond usual $2\times2$ cosine/sine rotations, no extra learnable parameters, and requires only the storage of three position-ID arrays (one for each axis). The computational overhead is less than $1\%$ of a Transformer block.

Integration is minimal: replace the traditional RoPE function in each self-attention head with the interleaved variant. All input and output tensor shapes, softmax attention computation, and parameterization are unchanged relative to RoPE-equipped Transformers. Compatibility with existing rotary-embedding libraries is preserved, requiring only axis-based indexing of positional encodings (Huang et al., 27 Oct 2025).

5. Empirical Results and Comparative Gains

Evaluations across multimodal benchmarks (MVBench, STAR, VideoMME, LVBench, MLVU, Charades) demonstrate that MRoPE-I achieves an overall score of $64.95$, outperforming alternative interleaving ratios (e.g., $63.29$ for $32:16:16$ and $63.03$ for $48:8:8$ allocations). In head-to-head comparison with MHRoPE, MRoPE-I consistently achieves $0.5$–$1.0$ points higher accuracy due to within-head fusion of axis positional information (Huang et al., 27 Oct 2025).

On long-sequence extrapolation tasks (video sequences up to $256$k frames), MRoPE-I circumvents the sharp degradation seen with standard RoPE and requires a smaller ( $\frac{3}{4}\times$ ) YaRN scaling factor to maintain stable performance. Ablation studies of “spatial-reset” (re-zeroing $h,w$ positions) reveal that MRoPE-I focuses $\sim20\%$ more attention mass on visual tokens in deeper layers, substantiating its enhanced cross-modal integration capabilities.

6. Interleaved Codex in Multi-Point Reed–Muller Decoding

The concept of interleaving is independently pivotal in efficient local decoding of Reed–Muller codes. An interleaved codex (Cramer et al., 2016) denotes a code construction where a standard codex over $\F_{q^2}$, with privacy and product-reconstruction properties, is concatenated with a multiplication-friendly pair, yielding a code of length $n q$ over $\F_q$. This facilitates simultaneous recovery of $k$ coordinates in a codeword, with query complexity $O(q^2 k)$ , compared to $O(q k^2)$ required by naive $k$ -fold repetition.

The multi-point decoding algorithm builds random codewords for each input dimension, queries at $nq$ interleaved positions, and decodes via half-distance rules. Lemma 2.12 establishes strong tail bounds for error probability via $t$ -wise independence: $\Pr\left(\left|\sum_{i}X_i-\mu\right| > A\right) \leq 8\left(\frac{t\mu + t^2/2}{A^2}\right)^{t/2}$ Parameter regimes and decoding theorems ensure correct reconstruction with high probability, and the interleaved scheme is demonstrably more efficient for large $k$ (Cramer et al., 2016).

7. Significance and Outlook

Interleaved MRoPE unifies direct, high-fidelity representation of multimodal position information with computational efficiency and transferability of pre-trained weights. In Transformer-based vision–language architectures, it obviates the need for architectural surgery and avoids low-frequency underutilization or axis isolation. In coding theory, interleaved codex enables efficient, simultaneous decoding of multiple coordinates at scale, backed by rigorous probabilistic bounds.

A plausible implication is that interleaving, as an architectural and algorithmic motif, generalizes to situations requiring multi-axis or multi-target encoding and decoding, while strictly optimizing both theoretical and practical efficiency. Future exploration may focus on further generalization to $n$ -dimensional modalities, automated frequency allocation strategies, and library-level integration for specialized domains.