Circle-RoPE: Cone-Like Positional Encoding

Updated 30 November 2025

Circle-RoPE is a cone-like rotary positional encoding that decouples text and image tokens by mapping image tokens onto a circular trajectory for unbiased multimodal fusion.
It leverages group-theoretic methods (SO(2) rotations) and a geometric construction to ensure all cross-modal token distances are equal, eliminating spurious dependencies.
Empirical evaluations demonstrate Circle-RoPE’s efficacy, with performance gains across benchmarks (average score of 68.28) while maintaining detailed spatial fidelity.

Circle-RoPE (Cone-like Decoupled Rotary Positional Embedding) is a positional encoding scheme designed to resolve cross-modal positional bias in large vision-LLMs (LVLMs), particularly when extending rotary positional embedding (RoPE) to joint text-image token sequences. Unlike standard RoPE, which entangles text and image token indices and induces spurious alignments, Circle-RoPE geometrically maps image tokens onto a circular trajectory orthogonal to the linear text token path. This cone-like structure explicitly equalizes cross-modal distances, thus eliminating unintended positional dependencies while preserving intra-image spatial information (Wang et al., 22 May 2025). The construction and theoretical foundation leverage group-theoretic principles, recasting Circle-RoPE as an SO(2) (unit circle) special case of the broader N-dimensional RoPE framework (Liu et al., 7 Apr 2025).

1. Standard Rotary Positional Embedding (RoPE)

RoPE [Su et al. 2024] encodes positional information by rotating each token’s query and key vectors in the complex plane, parameterized by the token’s position $p$ and a set of base frequencies $\omega_j$ . For the $j$ -th vector pair, the rotation is given by $R(p)_j = \begin{pmatrix} \cos(p\,\omega_j) & -\sin(p\,\omega_j) \ \sin(p\,\omega_j) & \cos(p\,\omega_j) \end{pmatrix}$ These rotations produce attention scores that depend solely on relative positions $(p_i - p_j)$ : $(R(p_i)q_i)^\mathsf{T}(R(p_j)k_j) = q_i^\mathsf{T}k_j\, \cos[(p_i-p_j)\omega_j] + \cdots$ RoPE thus encodes translation-invariant relative dependencies without learned absolute embeddings, enabling extrapolation and efficient modeling in self-attention architectures.

When RoPE is applied naively to concatenated text and image token streams in LVLMs, relative position encoding links text token indices and flattened image token indices. This creates artificial dependencies:

Semantic misalignment: Text tokens attend disproportionately to image tokens with minimal $|t-i|$ , irrespective of their true spatial positions.
Patch inequivalence: Multiple image patches representing identical content receive different RoPE biases due to their indices, resulting in unequal cross-modal associations.

These biases disrupt the intended semantics and induce spurious cross-modal alignments, degrading multimodal reasoning (Wang et al., 22 May 2025).

3. Per-Token Distance (PTD) Metric

Circle-RoPE introduces the Per-Token Distance (PTD) to quantify independence of positional encodings between modalities. For text tokens $T = \{t_1,\ldots,t_{N_t}\}$ and image tokens $I = \{i_1,\ldots,i_{N_v}\}$ indexed appropriately, with $D_{\text{abs}}(t,i) = |t-i|$ the mean image distance per text token is $\bar D_t = \frac{1}{N_v} \sum_{i \in I} D_{\text{abs}}(t,i)$ and $\mathrm{PTD} = \frac{1}{N_t N_v} \sum_{t \in T} \sum_{i \in I} | D_{\text{abs}}(t,i) - \bar D_t |$ PTD measures nonuniformity of distances; PTD $=0$ signifies perfect decoupling, i.e., text tokens are equidistant (in the embedding sense) from all image tokens (Wang et al., 22 May 2025).

4. Geometric Construction of Circle-RoPE

Circle-RoPE achieves cross-modal decoupling by geometrically constructing the sequence positions:

Centralization: Image token grid $C$ is centered,

$C' = C - P_{\text{center}}$

Mixed-Angle Circular Mapping: Each image patch receives a mix of angles,

$\theta_{ij}^{\text{mix}} = \alpha\, \theta_{ij}^{\text{SA}} + (1-\alpha)\, \theta_{ij}^{\text{GA}}$

mapped onto a circle of radius $R$ .

Target Plane Rotation: The image token circle is rotated to lie orthogonal to the text token axis in 3D space; text tokens remain on the axis, image tokens are mapped via

$P^{\text{proj}}_{ij} = x_{ij}^{\text{circ}}\,\mathbf u + y_{ij}^{\text{circ}}\,\mathbf v$

forming a cone-like structure (see Fig. 3(b,c) in (Wang et al., 22 May 2025)).

RoPE Application: Each projected 3D coordinate serves as its “position” for standard RoPE, producing identical pairwise relative distances between text and image tokens.

5. Theoretical Justification: PTD=0

Let text positions be $\{\lambda\,\mathbf n\}$ and image token positions $P_k$ on a circle of radius $R$ orthogonal to $\mathbf n$ . Then,

$\|\lambda\,\mathbf n - P_k\|_2 = \sqrt{\lambda^2 + R^2}$

for any $k$ . The Euclidean distance from any text token to any image token is invariant, ensuring all cross-modal RoPE biases are identical. Thus, PTD vanishes, confirming Circle-RoPE’s decoupling property (Wang et al., 22 May 2025).

6. Staggered Layer Alternating Encoding

To address minor degradation in intra-image spatial detail arising from pure Circle-RoPE, an Alternating Geometry Encoding (AGE) strategy is implemented:

Odd-numbered transformer layers use 2D grid-based M-RoPE.
Even-numbered layers apply Circle-RoPE on cone-like indices.

This alternation enables lower layers to capture fine-grained image geometry, while higher layers benefit from cross-modal decoupling (Wang et al., 22 May 2025).

7. Empirical Evaluation and Results

Experimental validation uses Qwen2.5-VL-3B with frozen vision encoder, finetuned only on the LLM; training data is a curated 1M-sample subset of MAmmoTH-VL. Evaluation spans diverse benchmarks (MMMU, MMMU-Pro, MathVista, MMStar, MMBench, AI2D, ChartQA, RealWorldQA, TextVQA):

Circle-RoPE achieves state-of-the-art performance across modalities, with an average score of 68.28 compared to 66.93 for the underlying Qwen2.5-VL-3B baseline.
Notable improvements include MMMU (+1.89), AI2D (+3.66), MathVista (+1.0), TextVQA (+1.32).
Ablation studies determine optimal angle-mixing at $\alpha=0.5$ , dual-frame fusion at $\beta=0.1$ , and demonstrate that AGE outperforms static encoding choices.

Dataset	SAIL-VL	InternVL2.5	Circle-RoPE
MMMU_val	41.44	51.56	52.11
MMMU-Pro_all	14.51	26.65	28.44
MathVista	60.70	60.60	63.40
MMStar	56.47	58.53	58.20
AI2D	77.72	81.38	81.80
ChartQA_test	80.20	78.08	84.12
TextVQA	77.20	78.71	80.39
Avg	62.67	66.93	68.28

These results evidence Circle-RoPE’s efficacy in resolving cross-modal positional bias without forfeiting spatial fidelity (Wang et al., 22 May 2025).

8. Mathematical Foundations and Broader Framework

Circle-RoPE is formulated within the Lie group/Lie algebra blueprint for N-dimensional RoPE (Liu et al., 7 Apr 2025):

RoPE is cast as a one-parameter subgroup $R(p) = \exp(p\,B)$ , with $B$ spanning the maximal abelian subalgebra (MASA) of $\mathfrak{so}(2)$ for the circle (SO(2)).
The resulting 2D rotations ensure relativity and reversibility, encoding each discrete token position $p$ into an angle $\theta(p) = \omega\,p$ , where $\omega$ is a selected frequency.
For multi-head attention, input queries and keys are split into pairs; each pair receives a rotation $R(\omega_i\,p)$ , generalizing Circle-RoPE across channels.
This group-theoretic formulation unifies RoPE mechanism across modalities, enabling theoretically sound extensions to higher-dimensional positional embeddings.

Circle-RoPE thus exemplifies the SO(2) instantiation of multi-dimensional RoPE with explicit geometric separation of modalities, ensuring rigorous decoupling and optimal multimodal fusion (Liu et al., 7 Apr 2025, Wang et al., 22 May 2025).

Markdown Upgrade to Chat

References (2)

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models (2025)

Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Encoding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Circle-RoPE.