Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pyramid Position Encoding Generator (PPEG)

Updated 11 March 2026
  • PPEG is a visual position encoding method that uses concentric ring indexing from the image periphery to the center, reducing attention decay in VLMs.
  • It dynamically adjusts encoding parameters with descent intervals across transformer layers to balance fine- and coarse-grained performance.
  • Empirical results show improved multimodal benchmarks, with enhanced visual perception, cross-modal attention, and reduced anchor token over-aggregation.

Pyramid-descent Visual Position Encoding (PyPE) is a position encoding scheme developed for vision-LLMs (VLMs) that aims to address shortcomings in traditional visual token position encodings, specifically the long-term decay and position misalignment induced by methods such as @@@@1@@@@ (RoPE) combined with raster-scan indexings. PyPE employs a concentric “ring index” approach, assigning visual token positions from the image periphery toward the center and dynamically expanding the central receptive field as model depth increases. This method mitigates attention decay across large visual patches and enables improved multi-granularity visual perception, facilitating more rational allocation of cross-modal attention and reducing the over-aggregation of anchor tokens in large-scale multimodal models (Chen et al., 19 Jan 2025).

1. Peripheral-to-center Visual Position Indexing

PyPE begins by transforming the linear sequence of visual tokens—produced by a traditional raster scan—back onto their two-dimensional H×WH \times W patch grid. Visual positions are then indexed using integer-valued rings, denoted p=1,2,,Pmaxp = 1, 2, \ldots, P_{\text{max}}, where all tokens at the outer border receive p=1p=1, tokens in the next inner border p=2p=2, continuing inward to the central region, which receives p=Pmaxp = P_{\text{max}}. For a patch at location (i,j)(i, j) (with i,ji, j zero-based), the assignment is formalized as:

P(i,j)=p    pi<Hp and pj<Wp.P(i, j) = p \iff p \leq i < H - p {\rm~and~} p \leq j < W - p.

Peripheral patches are thus indexed with low pp, and central patches with high pp. This ring-indexing captures hierarchical spatial relationships, with semantically related regions in the same or adjacent rings receiving similar indices—reducing their relative positional distance in later computations.

2. Mathematical Formulation and Integration with Attention

The encoding begins by setting Pmax(0)=min(H,W)/2P_{\text{max}}^{(0)} = \left\lfloor \min(H, W) / 2 \right\rfloor, representing the outermost possible concentric rings based on the patch grid shape. For each transformer layer \ell, the maximum ring index shrinks according to

Pmax()=max(1,Pmax(0)/t),P_{\text{max}}^{(\ell)} = \max\left(1, P_{\text{max}}^{(0)} - \left\lfloor \ell / t \right\rfloor\right),

where tt is the descent interval (layers per shrinkage step). At each layer, the ring-index matrix P()NH×W\mathbb{P}^{(\ell)} \in \mathbb{N}^{H \times W} is constructed such that all patches at a given ring pp satisfy the border constraint above.

For attention computation, P()\mathbb{P}^{(\ell)} is flattened to a length-HWHW vector. The positions for text tokens are appended, forming the total sequence for the RoPE mechanism. Attention between query and key at positions mm and nn utilizes RoPE’s complex rotation:

Am,n=Re{(eimΘqm)(einΘkn)}=Re{qmei(mn)Θkn},A_{m, n} = \mathrm{Re} \left\{ (e^{im\Theta} q_m)^{\top} (e^{in\Theta} k_n) \right\} = \mathrm{Re} \left\{ q_m^{\top} e^{i(m-n)\Theta} k_n \right\},

with m,nm, n drawn from the ring-indexed sequence. Since mn|m-n|—the relative distance—remains small for semantically proximate patches, the exponential attenuation in ei(mn)Θe^{i(m-n)\Theta} is largely eliminated for related visual regions.

3. Comparison with Raster-scan and Other Position Encodings

Traditional raster-scan indexing allocates positions in row-major order:

mraster(i,j)=iW+j,m_{\text{raster}}(i, j) = i \cdot W + j,

yielding large absolute positional differences for spatially distant patches, which in RoPE leads to pronounced attention decay. PyPE, in contrast, bases mpyp(i,j)m_{\text{pyp}}(i, j) on ring assignment:

mpyp(i,j)=P(i,j),m_{\text{pyp}}(i, j) = \mathbb{P}(i, j),

so that patches in the same ring have mn=0|m-n|=0 and adjacent rings have mn=1|m-n|=1. This keeps mn=O(min(H,W)/2)HW|m-n| = O(\min(H, W)/2) \ll H W, preserving uniform attention across the entire image.

The following table summarizes the difference:

Encoding Position Assignment RoPE Relative Distance
Raster-scan iW+ji \cdot W + j Grows as HWH \cdot W
PyPE (ring) P(i,j)\mathbb{P}(i, j) O(min(H,W)/2)O(\min(H,W)/2)

PyPE thus enables the model to assign attention more uniformly and regions in corresponding spatial zones interact without the penalty of position-induced decay.

4. Algorithmic Implementation

The PyPE algorithm proceeds as follows:

1
2
3
4
5
6
7
8
9
10
11
P_max = floor(min(H,W)/2)
for l in 1 to L:
    if (l mod t == 0) and P_max > 1:
        P_max = P_max - 1
    P = zeros(H, W)
    for p in 1 to P_max:
        for i in p to (H-p-1):
            for j in p to (W-p-1):
                P[i, j] = p
    # flatten P to HW vector, append text-token positions
    # apply RoPE using indices from this sequence

Integration into a standard transformer layer involves substituting the current layer’s P()\mathbb{P}^{(\ell)}-based indices when computing rotary embeddings for queries and keys.

5. Key Hyperparameters and Design Considerations

Several key parameters govern the effectiveness of PyPE:

  • Descent interval tt: Controls how frequently PmaxP_{\text{max}} is decremented. Ablations identified t=2t=2 (“PyPE 2×”) as providing the optimal trade-off between fine- and coarse-grained performance.
  • Initial Pmax(0)P_{\text{max}}^{(0)}: Set to min(H,W)/2\left\lfloor \min(H, W) / 2 \right\rfloor, ensuring the innermost ring encompasses a 1×11 \times 1 or 2×22 \times 2 patch.
  • Minimum PmaxP_{\text{max}}: Capped at 1, at which point all visual positions coalesce into a single ring (degenerate case).
  • Patch grid size H×WH \times W: Standard configurations include 21×2121 \times 21 for ViT-L/14 at 336×336336 \times 336.
  • RoPE frequencies Θ\Theta: Remain unchanged from standard usage, typically with base 10,000.

These parameters balance multi-granularity sensitivity, receptive field growth over depth, and the maintenance of positional discriminability.

6. Empirical Results and Observed Benefits

Quantitative results demonstrate the efficacy of PyPE across a range of VLM scales and benchmarks:

  • MME perception benchmark: PyPE improved total fine-grained+coarse scores by +12.3 for the 3B model (1500.7 vs. 1488.4), +32.5 for the 7B model (1542.2 vs. 1510.7), and +48.0 for the 13B model (1629.4 vs. 1581.5).
  • VQAv2: Accuracy increased from 78.93→79.22 (3B), 78.56→79.15 (7B), and 79.14→79.95 (13B).
  • General-multimodal benchmarks (MMBench, POPE, SEED-Bench, MMStar, etc.): PyPE consistently matched or outperformed both raster-scan and concentric PE baselines. For example, MMStar exhibited a +1.52 gain on the 13B model.

Qualitatively, cross-attention heatmaps indicate that PyPE distributes attention more broadly over relevant visual regions, mitigates biased aggregation on anchor tokens, reduces object hallucination, and enhances fine detail recall.

The optimal overall performance was obtained with t=2t=2, whereas overly long intervals (t>4t>4) or rapid descent (t=1t=1) had negative effects on either fine- or coarse-grained capabilities.

7. Context and Significance

PyPE replaces the conventional 1D raster position encoding with a dynamic, shrinking, concentric “ring index” scheme that expands the model’s effective central receptive field layer by layer. By maintaining smaller relative positional distances between related tokens, PyPE alleviates RoPE’s long-term decay, permits more rational cross-modal attention allocation, and curbs the over-aggregation of LLM anchor tokens. This enables VLMs to achieve improved performance across diverse perceptual and multimodal reasoning tasks and establishes PyPE as an effective alternative for position encoding in large vision-language transformer models (Chen et al., 19 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pyramid Position Encoding Generator (PPEG).