Papers
Topics
Authors
Recent
Search
2000 character limit reached

Concentric Causal Attention (CCA)

Updated 20 March 2026
  • Concentric Causal Attention (CCA) is a positional reordering strategy that restructures visual tokens into concentric rings and adapts causal masks to reduce object hallucination in LVLMs.
  • It approximately halves the maximum and average token distances between visual features and language instructions, improving multimodal attention despite the inherent long-range decay of RoPE.
  • Empirical evaluations show CCA outperforms raster-scan baselines on object hallucination benchmarks without modifying core model architectures or adding parameters.

Concentric Causal Attention (CCA) is a positional alignment and attention-masking strategy designed to mitigate object hallucination in Large Vision LLMs (LVLMs) by restructuring the order of visual tokens and adapting the causal attention mask. CCA specifically targets limitations in Rotary Position Encoding (RoPE)–based Transformers, where long-range positional decay impedes effective visual–instruction interactions. By reordering visual patch tokens into concentric rings rather than raster scan and redefining the causal mask accordingly, CCA halves maximum and average token distances from visual content to language instructions. This leads to demonstrable improvements on multiple object hallucination benchmarks without introducing additional parameters or altering core model architectures (Xing et al., 2024).

1. Background: Rotary Position Encoding and Long-Term Decay

Transformers with RoPE encode positional information by rotating query and key vectors through block-diagonal matrices parameterized by token position. For a self-attention layer, un-encoded query and key vectors qi,kjRdq_i, k_j \in \mathbb{R}^d interact via ai,jexp(qikjd)a_{i,j} \propto \exp\left( \frac{q_i^\top k_j}{\sqrt{d}} \right). RoPE injects position via

qiRθ,idqi,kjRθ,jdkj,q_i \leftarrow R^d_{\theta,i} q_i,\quad k_j \leftarrow R^d_{\theta,j} k_j,

where Rθ,mdR^d_{\theta,m} is a block-diagonal matrix with rotation parameters θi=100002(i1)/d\theta_i = 10000^{-2(i-1)/d}. The resulting attention score depends on relative position jij-i: ai,jexp(qiRθ,jidkjd).a_{i,j} \propto \exp\left( \frac{q_i^\top R^d_{\theta,j-i} k_j}{\sqrt{d}} \right). Empirically, as ji|j-i| increases, Rθ,jidR^d_{\theta,j-i} rotates qiq_i and kjk_j to become decorrelated, making ai,ja_{i,j} shrink sharply (“long-term decay”). In LVLMs, where visual tokens (preprocessed image features) are concatenated ahead of textual tokens, those visual tokens furthest from the start of the instruction sequence are most affected, resulting in degraded visual grounding and object hallucination when relevant cues are distant in the multimodal token stream (Xing et al., 2024).

2. Concentric Causal Attention: Definition and Core Mechanisms

CCA addresses the RoPE-induced distance decay by reorganizing visual token ordering and redesigning the causal attention mask. The strategy consists of two principal components: Concentric Visual Token Re-organization and Concentric Causal Masking.

2.1 Concentric Visual Token Re-organization

The visual feature grid of size V×V\sqrt{V} \times \sqrt{V} is partitioned into KV/2K \equiv \lceil \sqrt{V}/2 \rceil concentric rings, with each token assigned a ring index: ring(x,y)=1+min{x,V1x,y,V1y}\mathrm{ring}(x, y) = 1 + \min\{x, \sqrt{V}-1-x, y, \sqrt{V}-1-y\} for grid coordinates (x,y)(x, y). Tokens are reordered such that the sequence progresses from the outer ring to the innermost, preserving clockwise locality within each ring. This reordering reduces the maximum relative distance from any visual token to the instruction segment from V+T1V+T-1 (raster-scan) to K+T1V/2+T1K+T-1 \approx \sqrt{V}/2 + T - 1.

2.2 Concentric Causal Masking

CCA modifies the 1-D causal mask to align with the new ring-based permutation. The mask Mi,jCCAM_{i,j}^\mathrm{CCA} is defined by the permutation π\pi that maps raster-scan indices to ring order: Mi,jCCA={0if π(j)π(i) otherwiseM_{i,j}^{\mathrm{CCA}} = \begin{cases} 0 & \text{if } \pi(j) \leq \pi(i) \ -\infty & \text{otherwise} \end{cases} This ensures that attention is causal within the new concentric ordering, preventing "inner" rings (later) from attending to "outer" rings (earlier), thus respecting the 2-D locality during decoding as visual context propagates inward.

3. Integration into LVLM Architectures

In models such as LLaVA, the baseline pipeline feeds image features through a vision encoder and MLP, flattens the spatial grid into a linear sequence (raster scan), and concatenates this sequence with text tokens prior to Transformer-based decoding with RoPE. CCA is incorporated by:

  • Reshaping post-MLP visual tokens to a grid.
  • Applying the concentric ring permutation.
  • Constructing the CCA mask according to the new order.
  • Concatenating with instruction tokens (positions V+1V+1 to V+TV+T).
  • Feeding the resulting sequence through the original Transformer decoder (with unchanged weight parameters).

No new embedding layers, learned positional parameters, or architectural modifications are necessary; only the token sequencing and the mask bitmap are altered (Xing et al., 2024).

4. Theoretical Properties

CCA reduces the maximum and average relative distance from visual to instruction tokens by a factor of approximately two, as the reorganized sequence ensures that both are much closer in the attention computation. Let Δbs\Delta_{bs} denote the set of relative distances under raster-scan, and Δcca\Delta_{cca} under CCA. The maximum in CCA is V/2+(T1)\lceil \sqrt{V}/2 \rceil + (T-1), compared to V+(T1)\sqrt{V} + (T-1) in raster. Since the RoPE attention kernel decays monotonically with increased Δ|\Delta|, this compression raises average attention weights on distant (previously weakly attended) visual tokens, enabling more effective multimodal integration (Xing et al., 2024).

5. Empirical Evaluation

CCA demonstrates substantial improvements over previous hallucination mitigation strategies across diverse LVLM benchmarks. The following tables summarize key results reported in (Xing et al., 2024):

Object Hallucination Benchmarks

Method POPE acc (%) POPE F1 (%) CHAIR_S ↓ CHAIR_I ↓ CHAIR recall ↑ MME (total)
LLaVA 79.1 78.1 46.2 12.9 80.3 565.3
VCD 84.5 84.3 604.7
CCA 86.9 85.5 43.0 11.5 80.4 641.7

Human and Multiple-Choice Evaluations

Method GPT-4 Overall Multi-choice Overall
LLaVA 58.9 58.6
OPERA 61.3
VCD 58.3
CCA 64.3 61.7

CCA consistently outperforms LLaVA (raster order baseline) and prior technique VCD across POPE, CHAIR, MME, GPT-4 ratings, and multiple-choice suites. Notably, CHAIR hallucination rates are reduced (lower CS, CI), with recall of ground truth objects preserved.

Ablation studies confirm that the chosen V/2\lceil \sqrt{V}/2 \rceil ring decomposition is both minimal for reducing max distance and more locality-preserving than fixed-width k×kk \times k subblocks for k>2k>2.

6. Discussion and Limitations

CCA specifically targets the long-term decay in RoPE by minimizing the attention distance between visual and instruction tokens. However, it does not address other sources of hallucination (e.g., biases in data, generative overconfidence). CCA yields minimal benefit when VV is sufficiently small, as long-distance effects are negligible. Applicability is limited to fixed grid image representations; highly variable or non-square grids would require adaptation.

Future research directions include extending positional reorganization to additional modalities (audio, video) where standard flattening disrupts locality, exploring learnable or hybrid positioning strategies, and integrating CCA with data-centric and post-hoc hallucination correction procedures.

CCA represents a parameter-free, architecture-agnostic intervention, effecting a substantial reduction in RoPE-induced vision–language grounding decay solely by modifying token order and mask structure, and demonstrating broad empirical efficacy on standard hallucination metrics (Xing et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Concentric Causal Attention (CCA).