Papers
Topics
Authors
Recent
2000 character limit reached

Circle-RoPE: Cone-Like Positional Encoding

Updated 30 November 2025
  • Circle-RoPE is a cone-like rotary positional encoding that decouples text and image tokens by mapping image tokens onto a circular trajectory for unbiased multimodal fusion.
  • It leverages group-theoretic methods (SO(2) rotations) and a geometric construction to ensure all cross-modal token distances are equal, eliminating spurious dependencies.
  • Empirical evaluations demonstrate Circle-RoPE’s efficacy, with performance gains across benchmarks (average score of 68.28) while maintaining detailed spatial fidelity.

Circle-RoPE (Cone-like Decoupled Rotary Positional Embedding) is a positional encoding scheme designed to resolve cross-modal positional bias in large vision-LLMs (LVLMs), particularly when extending rotary positional embedding (RoPE) to joint text-image token sequences. Unlike standard RoPE, which entangles text and image token indices and induces spurious alignments, Circle-RoPE geometrically maps image tokens onto a circular trajectory orthogonal to the linear text token path. This cone-like structure explicitly equalizes cross-modal distances, thus eliminating unintended positional dependencies while preserving intra-image spatial information (Wang et al., 22 May 2025). The construction and theoretical foundation leverage group-theoretic principles, recasting Circle-RoPE as an SO(2) (unit circle) special case of the broader N-dimensional RoPE framework (Liu et al., 7 Apr 2025).

1. Standard Rotary Positional Embedding (RoPE)

RoPE [Su et al. 2024] encodes positional information by rotating each token’s query and key vectors in the complex plane, parameterized by the token’s position pp and a set of base frequencies ωj\omega_j. For the jj-th vector pair, the rotation is given by R(p)j=(cos(pωj)sin(pωj) sin(pωj)cos(pωj))R(p)_j = \begin{pmatrix} \cos(p\,\omega_j) & -\sin(p\,\omega_j) \ \sin(p\,\omega_j) & \cos(p\,\omega_j) \end{pmatrix} These rotations produce attention scores that depend solely on relative positions (pipj)(p_i - p_j): (R(pi)qi)T(R(pj)kj)=qiTkjcos[(pipj)ωj]+(R(p_i)q_i)^\mathsf{T}(R(p_j)k_j) = q_i^\mathsf{T}k_j\, \cos[(p_i-p_j)\omega_j] + \cdots RoPE thus encodes translation-invariant relative dependencies without learned absolute embeddings, enabling extrapolation and efficient modeling in self-attention architectures.

2. Cross-Modal Positional Bias in LVLMs

When RoPE is applied naively to concatenated text and image token streams in LVLMs, relative position encoding links text token indices and flattened image token indices. This creates artificial dependencies:

  • Semantic misalignment: Text tokens attend disproportionately to image tokens with minimal ti|t-i|, irrespective of their true spatial positions.
  • Patch inequivalence: Multiple image patches representing identical content receive different RoPE biases due to their indices, resulting in unequal cross-modal associations.

These biases disrupt the intended semantics and induce spurious cross-modal alignments, degrading multimodal reasoning (Wang et al., 22 May 2025).

3. Per-Token Distance (PTD) Metric

Circle-RoPE introduces the Per-Token Distance (PTD) to quantify independence of positional encodings between modalities. For text tokens T={t1,,tNt}T = \{t_1,\ldots,t_{N_t}\} and image tokens I={i1,,iNv}I = \{i_1,\ldots,i_{N_v}\} indexed appropriately, with Dabs(t,i)=tiD_{\text{abs}}(t,i) = |t-i| the mean image distance per text token is Dˉt=1NviIDabs(t,i)\bar D_t = \frac{1}{N_v} \sum_{i \in I} D_{\text{abs}}(t,i) and PTD=1NtNvtTiIDabs(t,i)Dˉt\mathrm{PTD} = \frac{1}{N_t N_v} \sum_{t \in T} \sum_{i \in I} | D_{\text{abs}}(t,i) - \bar D_t | PTD measures nonuniformity of distances; PTD=0=0 signifies perfect decoupling, i.e., text tokens are equidistant (in the embedding sense) from all image tokens (Wang et al., 22 May 2025).

4. Geometric Construction of Circle-RoPE

Circle-RoPE achieves cross-modal decoupling by geometrically constructing the sequence positions:

  • Centralization: Image token grid CC is centered,

C=CPcenterC' = C - P_{\text{center}}

  • Mixed-Angle Circular Mapping: Each image patch receives a mix of angles,

θijmix=αθijSA+(1α)θijGA\theta_{ij}^{\text{mix}} = \alpha\, \theta_{ij}^{\text{SA}} + (1-\alpha)\, \theta_{ij}^{\text{GA}}

mapped onto a circle of radius RR.

  • Target Plane Rotation: The image token circle is rotated to lie orthogonal to the text token axis in 3D space; text tokens remain on the axis, image tokens are mapped via

Pijproj=xijcircu+yijcircvP^{\text{proj}}_{ij} = x_{ij}^{\text{circ}}\,\mathbf u + y_{ij}^{\text{circ}}\,\mathbf v

forming a cone-like structure (see Fig. 3(b,c) in (Wang et al., 22 May 2025)).

  • RoPE Application: Each projected 3D coordinate serves as its “position” for standard RoPE, producing identical pairwise relative distances between text and image tokens.

5. Theoretical Justification: PTD=0

Let text positions be {λn}\{\lambda\,\mathbf n\} and image token positions PkP_k on a circle of radius RR orthogonal to n\mathbf n. Then,

λnPk2=λ2+R2\|\lambda\,\mathbf n - P_k\|_2 = \sqrt{\lambda^2 + R^2}

for any kk. The Euclidean distance from any text token to any image token is invariant, ensuring all cross-modal RoPE biases are identical. Thus, PTD vanishes, confirming Circle-RoPE’s decoupling property (Wang et al., 22 May 2025).

6. Staggered Layer Alternating Encoding

To address minor degradation in intra-image spatial detail arising from pure Circle-RoPE, an Alternating Geometry Encoding (AGE) strategy is implemented:

  • Odd-numbered transformer layers use 2D grid-based M-RoPE.
  • Even-numbered layers apply Circle-RoPE on cone-like indices.

This alternation enables lower layers to capture fine-grained image geometry, while higher layers benefit from cross-modal decoupling (Wang et al., 22 May 2025).

7. Empirical Evaluation and Results

Experimental validation uses Qwen2.5-VL-3B with frozen vision encoder, finetuned only on the LLM; training data is a curated 1M-sample subset of MAmmoTH-VL. Evaluation spans diverse benchmarks (MMMU, MMMU-Pro, MathVista, MMStar, MMBench, AI2D, ChartQA, RealWorldQA, TextVQA):

  • Circle-RoPE achieves state-of-the-art performance across modalities, with an average score of 68.28 compared to 66.93 for the underlying Qwen2.5-VL-3B baseline.
  • Notable improvements include MMMU (+1.89), AI2D (+3.66), MathVista (+1.0), TextVQA (+1.32).
  • Ablation studies determine optimal angle-mixing at α=0.5\alpha=0.5, dual-frame fusion at β=0.1\beta=0.1, and demonstrate that AGE outperforms static encoding choices.
Dataset SAIL-VL InternVL2.5 Circle-RoPE
MMMU_val 41.44 51.56 52.11
MMMU-Pro_all 14.51 26.65 28.44
MathVista 60.70 60.60 63.40
MMStar 56.47 58.53 58.20
AI2D 77.72 81.38 81.80
ChartQA_test 80.20 78.08 84.12
TextVQA 77.20 78.71 80.39
Avg 62.67 66.93 68.28

These results evidence Circle-RoPE’s efficacy in resolving cross-modal positional bias without forfeiting spatial fidelity (Wang et al., 22 May 2025).

8. Mathematical Foundations and Broader Framework

Circle-RoPE is formulated within the Lie group/Lie algebra blueprint for N-dimensional RoPE (Liu et al., 7 Apr 2025):

  • RoPE is cast as a one-parameter subgroup R(p)=exp(pB)R(p) = \exp(p\,B), with BB spanning the maximal abelian subalgebra (MASA) of so(2)\mathfrak{so}(2) for the circle (SO(2)).
  • The resulting 2D rotations ensure relativity and reversibility, encoding each discrete token position pp into an angle θ(p)=ωp\theta(p) = \omega\,p, where ω\omega is a selected frequency.
  • For multi-head attention, input queries and keys are split into pairs; each pair receives a rotation R(ωip)R(\omega_i\,p), generalizing Circle-RoPE across channels.
  • This group-theoretic formulation unifies RoPE mechanism across modalities, enabling theoretically sound extensions to higher-dimensional positional embeddings.

Circle-RoPE thus exemplifies the SO(2) instantiation of multi-dimensional RoPE with explicit geometric separation of modalities, ensuring rigorous decoupling and optimal multimodal fusion (Liu et al., 7 Apr 2025, Wang et al., 22 May 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Circle-RoPE.