Interleaved MRoPE for GUI Grounding

Updated 6 October 2025

The paper introduces I-MRoPE as a balanced positional encoding that interleaves frequency components across dimensions.
It mitigates standard MRoPE’s limitations by uniformly distributing low and high frequencies, ensuring consistent spatial detail.
I-MRoPE combined with RULER tokens improves GUI grounding precision, achieving robust mapping on high-resolution interfaces.

Interleaved MRoPE (I-MRoPE) is a positional encoding technique introduced to address the limitations of standard multidimensional rotary positional embeddings (MRoPE) when applied to tasks requiring fine-grained spatial localization across multiple axes, such as GUI grounding. It achieves balanced frequency allocation across spatial dimensions, enhancing the model’s capability to encode, compare, and generalize precise positions in two- or higher-dimensional layouts. This design is particularly effective in applications where accurate mapping of semantic instructions to pixel-level coordinates is critical, notably in GUI automation and instruction-following by vision-LLMs (Wang et al., 3 Oct 2025).

1. Motivation and Limitations of Standard MRoPE

Standard MRoPE extends one-dimensional rotary positional encodings (RoPE) by dividing the frequency spectrum into contiguous blocks for each dimension (e.g., width and height). In mathematical terms, positional rotations are applied via block-diagonal matrices:

$R^\mathrm{MRoPE}_{\Theta, t, h, w} = \operatorname{diag}(R_{\Theta_t, t}, R_{\Theta_h, h}, R_{\Theta_w, w})$

where each $R_{\Theta_*, *}$ receives a distinct, contiguous segment of the frequency spectrum. For a two-dimensional position $(h, w)$ , half the available frequencies encode $h$ and the other half encode $w$ . This block-wise assignment induces an imbalance: one axis receives only the lower frequencies (larger receptive field, coarse encoding); the other, the higher frequencies (finer granularity), or vice versa depending on assignment.

This imbalance is detrimental in applications such as GUI grounding where equivalent spatial precision is required on both axes. The model’s ability to resolve positional nuance along underrepresented axes is compromised, especially on high-resolution (large aspect ratio) layouts or when generalizing to combinations of coordinates not well-sampled during training (Wang et al., 3 Oct 2025).

2. Interleaving Frequency Assignment: Core Algorithmic Principle

I-MRoPE overcomes this imbalance by interleaving the assignment of frequency components across spatial axes:

For each positional encoding channel with index $j$ , the responsible dimension is set as

$p_j = \begin{cases} \text{height}, & j \bmod 2 = 0 \ \text{width}, & j \bmod 2 = 1 \end{cases}$

for the 2D case (generalized to three or more dimensions by increasing the modulus).

The result is a positional embedding where both axes are represented by an alternating sequence of low and high frequencies.
The rotation applied to channel $j$ for a token at coordinate $(h, w)$ becomes (for 2D)

$R_j = \begin{cases} (h, \theta_j) & \text{if } j \bmod 2 = 0 \ (w, \theta_j) & \text{if } j \bmod 2 = 1 \end{cases}$

where each $\theta_j$ represents a geometric progression of rotary frequencies, as in standard RoPE: $\theta_j = b^{-2j/d}$ ( $b$ is the RoPE base, $d$ is the head size).

This yields each spatial axis a full spectrum of positional frequencies, ensuring that both coarse and fine spatial information are uniformly distributed for both dimensions.

3. Mathematical Formulation and Properties

For a patch (or token) at position $m$ , the rotary positional transformation in each $2\times2$ channel block is

$R_{\theta_j, m} = \begin{bmatrix} \cos(m \theta_j) & -\sin(m \theta_j) \ \sin(m \theta_j) & \cos(m \theta_j) \end{bmatrix}$

with the angle’s scaling distributed according to the interleaved index. In I-MRoPE, $m$ is specialized to the appropriate spatial coordinate as above.

Key properties:

Both width and height receive all available frequency bands, promoting uniform sensitivity along both axes.
Local differences (high-frequency response) and global layout (low-frequency response) are made equally expressive in both spatial axes.
Interleaving is directly compatible with the original RoPE implementation, requiring only reordering in the frequency-to-dimension mapping at initialization or during embedding table construction.

4. Practical Benefits in GUI Grounding

GUI grounding tasks require mapping natural-language references to precise $x$ , $y$ pixel coordinates in potentially high-resolution, variable-size layouts. Standard MRoPE’s imbalance results in accuracy degradation for coordinates aligned to the underrepresented dimension, impacting robustness when scaling to larger or previously unseen aspect ratios.

Empirical evidence demonstrates that I-MRoPE improves grounding accuracy, especially in the following scenarios (Wang et al., 3 Oct 2025):

High-resolution screens, where coordinate space is denser and minor errors are exacerbated.
Transfer to unseen interface sizes, where previously unrepresented dimension/frequency combinations become common. I-MRoPE’s uniform frequency spectrum for each axis mitigates spatial bias, resulting in more faithful mapping of model outputs across a variety of resolutions and layouts.

5. Synergy with RULER Tokens

While I-MRoPE ensures balanced, precise spatial representations at the embedding level, further improvements are achieved by explicitly mapping embeddings to absolute pixel coordinates via RULER tokens. These tokens are:

Injected at fixed intervals, each marking a canonical position (e.g., every $s$ pixels along each axis).
For each visual patch, the model references the nearest RULER token and predicts an offset:

$\text{coordinate}_\text{final} = \text{coordinate}_\text{RULER} + \Delta$

$\Delta$ is constrained by $b = s \times p$ (with $p$ the patch size).

By shifting the regression from absolute position to a "reference-and-adjust" paradigm, the network’s burden of learning complex position-to-pixel mappings is greatly reduced. The combination of I-MRoPE’s uniform high/low-frequency coverage and RULER’s explicit referential mechanism substantially reduces out-of-distribution error on novel screen sizes.

6. Empirical Results and Application Scope

ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro evaluations report that models incorporating I-MRoPE (with RULER tokens) obtain significant gains in coordination localization, with the most pronounced improvements on high-resolution interfaces. The contributions are orthogonal; I-MRoPE addresses spatial encoding fidelity while RULER tokens address coordinate grounding instability (Wang et al., 3 Oct 2025).

The architectural enhancements generalize to any task where:

Multi-axial spatial localization at high precision is essential,
There is a need to extrapolate to unobserved position-frequency combinations,
Explicit referential mapping reduces the complexity of output regression.

A plausible implication is that analogous interleaving schemes may be valuable in domains such as map-based instruction following, spatial-language understanding, or grid-based robotic control where multidimensional absolute position must be faithfully encoded.

7. Limitations and Open Directions

I-MRoPE requires explicit design of frequency-channel allocation and careful compatibility with tokenization strategies, especially if the number of dimensions or scales is variable at inference. The practical gains depend on the surrounding architectural choices (e.g., whether spatial pyramid pooling or multi-scale patching is used) and on the availability of sufficiently dense position annotations for training.

No evidence is presented on its efficacy in self-supervised settings or in non-visual modalities, and the method assumes the underlying model architecture (e.g., vision transformer or hybrid VLM) supports custom positional encoding layers. Additional evaluation in multi-modal or non-grid-aligned two-dimensional tasks would be needed to fully characterize the scope of I-MRoPE’s benefits.

In summary, Interleaved MRoPE provides balanced positional encoding for multidimensional spatial tasks by interleaving frequency bands across spatial axes, overcoming the axis-dependent limitations of standard MRoPE. Its integration with explicit reference tokens (RULER) enables precise, robust, and generalizable GUI grounding across diverse resolutions and configurations (Wang et al., 3 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Interleaved MRoPE (I-MRoPE).