Papers
Topics
Authors
Recent
2000 character limit reached

MSRoPE-BiL: 3-Axis Positional Embedding

Updated 2 December 2025
  • The paper introduces an innovative 3-axis rotary positional embedding that extends standard 2D RoPE by incorporating a bi-directional layer index to unambiguously disambiguate multi-layer tokens.
  • It employs composite rotations with frequency partitioning for the x, y, and z axes, yielding a robust mechanism to capture spatial and inter-layer relationships in unified attention sequences.
  • Empirical results demonstrate significant improvements in unified RGBA tasks, including reduced SAD in mask-free matting and enhanced LPIPS scores in object removal compared to 2D RoPE methods.

MSRoPE-BiL (Multi-Scale Rotary Positional Embedding with Bi-directionally Extendable Layer Axis) is a three-axis rotary positional embedding scheme that extends standard 2D RoPE with a bi-directional “layer” index. Developed as a core architectural component of the OmniAlpha framework, MSRoPE-BiL enables simultaneous processing of multiple input and output RGBA images, along with text-derived VLM tokens, by a single transformer backbone. This technique is fundamental to achieving unified, multi-task RGBA manipulation, where each patch’s provenance (spatial location and layer index) is unambiguously disambiguated within a single attention sequence (Yu et al., 25 Nov 2025).

1. Motivation and Conceptual Foundations

Standard rotary positional embeddings (RoPE) encode only the 2D spatial grid—coordinates (x, y)—of an image token within transformer-based vision models. This suffices for single-image tasks but is insufficient for RGBA pipelines requiring multi-layer inputs and outputs, such as foreground/background separation, multi-modal compositing, and mask-free matting. In these settings, the transformer must process and discriminate multiple RGBA images (inputs and denoised targets) along with text tokens that may serve as task or content conditioning.

MSRoPE-BiL introduces a third positional axis, z (“layer index”), that assigns non-negative z to inputs and VLM tokens and negative z to outputs. The extension ensures that attention can distinguish not just “where” a patch is located, but also “which layer” it belongs to. The model is thereby enabled to reason jointly about spatial and inter-layer relationships among all elements of the sequence, avoiding the collapse of patches from distinct images into a single, spatially indexed grid.

2. Mathematical Formulation

Let each token’s position be p=(x,y,z)p = (x, y, z), where x{0,...,H1}x \in \{0, ..., H-1\} and y{0,...,W1}y \in \{0, ..., W-1\} represent patch coordinates, and zZz \in \mathbb{Z} indicates the layer (with z0z \geq 0 for inputs/VLM tokens and z<0z < 0 for outputs).

The 3-axis rotary embedding R(;p)R(\cdot; p) maps a dd-dimensional vector hRdh \in \mathbb{R}^d to a rotated vector embedding (x,y,z)(x, y, z) information:

  1. Frequency Partitioning Split the embedding into three frequency tables ωx,ωy,ωzRd/6\omega^x, \omega^y, \omega^z \in \mathbb{R}^{d/6}, allocating d/6d/6 dimensions per axis pair.
  2. Single-axis Rotation For an axis a{x,y,z}a \in \{x, y, z\}, define:

[Rota(h,a)]2i:2i+2=(cos(ωiaa)sin(ωiaa) sin(ωiaa)cos(ωiaa))(h2i h2i+1)\left[\mathrm{Rot}_a(h,a)\right]_{2i:2i+2} = \begin{pmatrix} \cos(\omega^a_i \cdot a) & -\sin(\omega^a_i \cdot a) \ \sin(\omega^a_i \cdot a) & \cos(\omega^a_i \cdot a) \end{pmatrix} \begin{pmatrix} h_{2i} \ h_{2i+1} \end{pmatrix}

  1. Composite Rotation Apply rotations sequentially:

R(h;x,y,z)=Rotz(Roty(Rotx(h,x),y),z)R(h; x, y, z) = \mathrm{Rot}_z\left(\mathrm{Rot}_y\left(\mathrm{Rot}_x(h, x), y\right), z\right)

For a query qq and key kk at respective positions pi=(xi,yi,zi)p_i = (x_i, y_i, z_i), pj=(xj,yj,zj)p_j = (x_j, y_j, z_j):

q=R(q;xi,yi,zi),k=R(k;xj,yj,zj)q' = R(q; x_i, y_i, z_i), \quad k' = R(k; x_j, y_j, z_j)

The attention score is then:

score(i,j)=(q)k\mathrm{score}(i, j) = (q')^\top k'

Due to the translation-invariance property of RoPE, this score depends only on coordinate differences (xixj,yiyj,zizj)(x_i - x_j, y_i - y_j, z_i - z_j), encoding relative spatial and layer relationships.

  1. Implementation Detail Since most RoPE routines assume non-negative indices, all raw zz values are shifted by an offset SoffsetmS_\text{offset} \geq m (the number of output layers), so

zimpl=zraw+mz_\text{impl} = z_\text{raw} + m

This maintains relative order while aligning negative output indices to a positive non-overlapping range.

3. Integration into the DiT Backbone

Within OmniAlpha’s Diffusion Transformer (DiT):

  • Every token’s position tuple (x,y,z)(x, y, z) is associated prior to MHA computation.
  • The rotary embedding R(;x,y,z)R(\cdot; x, y, z) is applied independently to Q=WqhQ = W_q h and K=WkhK = W_k h, with VV unchanged.
  • All modalities (inputs, outputs, VLM) share a single joint attention space, distinguished by zz.
  • At the first transformer layer, content is ordered as:
    • VLM tokens (z{n,...,n+T1}z \in \{n, ..., n+T-1\} for TT tokens)
    • Input RGBA patches (z=0,1,...,n1z = 0, 1, ..., n-1)
    • Output RGBA patches (z=1,2,...,mz = -1, -2, ..., -m)

This schema guarantees that each token's provenance is explicit and preserved, enabling unrestricted cross-modal attention while preserving layer separation.

4. Illustration of Multi-Layer Sequence Construction

A text-schematic illustration for n=2n=2 inputs (e.g., foreground/background) and m=3m=3 outputs (e.g., denoised foreground, denoised background, composite):

1
2
3
4
5
6
[ VLM1, VLM2, ..., VLM_T         ]   z ∈ {2, ..., 2+T−1}
[ FG_patch(0,0), ..., FG_patch(H−1,W−1) ]   z=0
[ BG_patch(0,0), ..., BG_patch(H−1,W−1) ]   z=1
[ Ŷ1_patch(0,0), ..., Ŷ1_patch(H−1,W−1) ]   z=−1
[ Ŷ2_patch(0,0), ..., Ŷ2_patch(H−1,W−1) ]   z=−2
[ Ŷ3_patch(0,0), ..., Ŷ3_patch(H−1,W−1) ]   z=−3

Attention scores for query positions such as (x,y,z=2)(x, y, z = -2) are thus aware of their precise output layer, and the model can compute pairwise dot-products across any combinations of input, output, and text tokens while maintaining exact spatial and inter-layer distinction.

5. Empirical Evaluation and Observed Benefits

No direct ablation isolates the effect of 2D RoPE versus 3D MSRoPE-BiL, but overall system results demonstrate its contribution to unified multi-RGBA processing:

  • Layer-conditioned completion: FG2FULL/BG2FULL win rates vs. LayerDiffuse (which uses 2D RoPE) exceed 85–95% by both LLM and human evaluation.
  • Mask-free matting (AIM-500): SAD reduced from 48.09 (SmartMatting, standard pipeline) to 7.80 using OmniAlpha plus MSRoPE-BiL (84.8% relative reduction).
  • Referring matting (RefMatte-RW100): SAD improved from 7.37 to 6.75.
  • RORD object removal (“decompose”): LPIPS improved from 0.1320 to 0.1268.

These consistent cross-task improvements can be attributed to MSRoPE-BiL’s ability to structure attention across multiple layers, a capacity not offered by standard 2D RoPE.

Task Baseline (2D RoPE) MSRoPE-BiL + OmniAlpha
FG2FULL/BG2FULL win rate 85–95%* 85–95%
Mask-free matting (AIM-500, SAD↓) 48.09 7.80
Referring matting (SAD↓) 7.37 6.75
RORD object removal (LPIPS↓) 0.1320 0.1268

*Win rates reported as outperforming LayerDiffuse baseline; the precise attribution of LayerDiffuse to 2D RoPE is as described in the source.

6. Distinction from Previous Methods and Broader Implications

MSRoPE-BiL’s primary distinction is its explicit bi-directional layer axis, which enables unified generative models to handle both multiple inputs and multiple outputs in parallel. This is in contrast to sequential or channel-stacking approaches, which convolute spatial and layer information or forego strict separation among modalities. The resulting representational capacity supports large-scale, generalist RGBA models and suggests applicability to other unified, multi-modal attention scenarios, where distinguishing between inputs, outputs, or conditional information may be necessary.

A plausible implication is that MSRoPE-BiL-style axis augmentation could be generalized to settings with additional structure, such as video (frame index), or annotation (instance index). This suggests a direction for constructing sequence-to-sequence frameworks capable of handling complex, multi-layered, and multi-modal generative tasks using a single attention mesh (Yu et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MSRoPE-BiL.