Papers
Topics
Authors
Recent
Search
2000 character limit reached

Polyline Path Masking for ViTs

Updated 13 June 2026
  • The paper presents a novel 2D polyline scanning mask for ViTs that preserves Manhattan spatial adjacency and enhances performance on classification, detection, and segmentation.
  • It details a structured mask construction using vertical-then-horizontal and horizontal-then-vertical paths with learnable decay factors predicted by small MLPs.
  • The study establishes a matrix factorization framework that reduces computational complexity from O(N^2) to O(N) under optimized conditions, enabling efficient attention operations.

Polyline Path Masked Attention (PPMA) is an architectural mechanism designed for Vision Transformers (ViTs) to explicitly encode spatial adjacency priors in 2D image feature maps through a structured masking strategy. PPMA integrates and enhances principles from both self-attention and state-space sequence models, particularly leveraging the structured mask philosophy of Mamba2. By introducing a 2D polyline (L-shaped) scanning mask, PPMA preserves the two-dimensional spatial relationships between patches more effectively than conventional “flatten-row” or “snake” 1D orderings. Empirical benchmarks demonstrate performance gains in classification, object detection, and segmentation over previous state-of-the-art models using both attention and state-space approaches (Zhao et al., 19 Jun 2025).

1. 2D Polyline Path Scanning: Motivation and Construction

Traditional 1D scan orderings for image patches—such as flattening by rows or snakes—disrupt local adjacency: two spatially adjacent patches in an image can become distant in the 1D sequence, weakening inductive bias for spatial continuity. PPMA addresses this by defining, for any pair of tokens (i,j)(i,j) and (k,)(k,\ell) on a 2D feature map of size H×WH \times W, an L-shaped polyline (the shortest Manhattan path) connecting source and target. There are two such routes:

  • Vertical then horizontal (V2H): (i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)
  • Horizontal then vertical (H2V): (i,j)(i,)(k,)(i,j) \to (i,\ell) \to (k,\ell)

The respective contributions from both V2H and H2V are summed to maintain symmetry. Decay factors αi,j\alpha_{i,j} (horizontal) and βi,j\beta_{i,j} (vertical) are predicted per token using small MLPs followed by a ReLU and exponential activation, enabling learnable, spatially-varying locality control.

2. Formal Definition of the Polyline Path Mask

The core mask is represented as a 4D tensor LRH×W×H×W\mathcal{L} \in \mathbb{R}^{H \times W \times H \times W}, where the element Li,j,k,\mathcal{L}_{i,j,k,\ell} aggregates the decay factors along both polyline routes:

Lijk=(n=j+1αi,n)(m=i+1kβm,)+(m=i+1kβm,j)(n=j+1αk,n)\mathcal{L}_{ij \to k\ell} = \left(\prod_{n=j+1}^{\ell} \alpha_{i,n}\right) \left(\prod_{m=i+1}^{k} \beta_{m,\ell}\right) + \left(\prod_{m=i+1}^{k} \beta_{m,j}\right) \left(\prod_{n=j+1}^{\ell} \alpha_{k,n}\right)

After computation, (k,)(k,\ell)0 is unfolded into a standard 2D mask (k,)(k,\ell)1: (k,)(k,\ell)2

This construction directly preserves Manhattan distance-based adjacency and can be interpreted as a highly structured input-dependent positional bias.

3. Theoretical Analysis and Complexity

PPMA’s mask admits a matrix-decomposition theorem. For any such mask (k,)(k,\ell)3, if one can construct per-row matrices (k,)(k,\ell)4 and per-column matrices (k,)(k,\ell)5 such that (k,)(k,\ell)6, then (k,)(k,\ell)7 factorizes as (k,)(k,\ell)8 or, entrywise, as (k,)(k,\ell)9, where H×WH \times W0 denotes the Hadamard product.

This factorization enables a two-stage matrix-vector product with H×WH \times W1 time—reshaping the input to H×WH \times W2, multiplying each column by H×WH \times W3, then multiplying each row by H×WH \times W4, and stacking the results. When the constituent matrices are 1-semiseparable and chunkwise Mamba2 algorithms are employed, the total complexity can be further reduced to H×WH \times W5.

Operation Naive Complexity Efficient (Factorized) Complexity
Fill H×WH \times W6 entrywise H×WH \times W7 H×WH \times W8
Matrix-vector product H×WH \times W9 (i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)0 (i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)1 / (i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)2 (chunkwise)

The explicit construction provides not only accuracy benefits but also provably efficient computational routines suitable for high-resolution vision.

4. Efficient Computation Algorithm

The efficient computation proceeds through per-row and per-column scans leveraging the decomposition. Given input decay maps (i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)3 and a vector (i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)4:

(i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)9

This structure reduces space and runtime overhead relative to naive (i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)5 attention masking.

5. Integration into Vision Transformer Attention

The polyline mask (i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)6 is incorporated into the standard self-attention formula, yielding Polyline Path Masked Vanilla Attention (PPMVA):

(i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)7

Alternatively, (i,j)(k,j)(k,)(i,j) \to (k,j) \to (k,\ell)8 can be absorbed as an additive positional bias, compatible with various attention formulations, including softmax, linear kernel, criss-cross (sparse), and decomposed attention mechanisms. Efficient masked mat-vec operations are directly applied in all such cases.

6. Empirical Performance and Ablations

PPMA’s empirical evaluation encompasses ImageNet-1K (classification), COCO-2017 (object detection/segmentation), and ADE20K (semantic segmentation) using standard ViT backbones. Key metrics include top-1 classification accuracy, box/mask average precision (AP), and mean intersection-over-union (mIoU). Variants are defined by model scale:

Variant Params FLOPs ImageNet Top-1 COCO Box AP/Mask AP ADE20K mIoU
PPMA-T 14M 2.7G 82.6% 47.1% / 42.4% 48.7%
PPMA-S 27M 4.9G 84.2% 51.1%
PPMA-B 54M 10.6G 85.0% 51.1% / 45.5% 52.3%
RMT-B 54M 10.6G 84.9% 50.7% / 45.1% 52.0%

Ablation studies reveal compounded benefits through hierarchical enhancements, with maximal performance achieved by combining polyline masking with RMT decay, cross-scan, and both V2H/H2V paths. For example, on PPMA-T, combining all factors yields 82.60% accuracy and 48.73 mIoU.

7. Advantages and Limitations

PPMA explicitly encodes a 2D Manhattan adjacency prior, augmenting the standard global context modeling of ViTs with strong local continuity inductive bias. This structured mask is agnostic to the underlying attention type and can be integrated into diverse attention variants with minor computational adaptation. Theoretical decomposability underpins both correctness and computational efficiency.

Limitations include increased GPU memory demand and reduced throughput relative to 1D-masked state-space models (approximately 20–40% slower than RMT). Current implementation is provided in PyTorch, with suggested acceleration possible via fused CUDA/Triton kernels.

In summary, PPMA combines global self-attention with explicit, learnable spatial priors, supported by a suite of structural theorems and efficient algorithms, yielding empirically robust improvements in standard visual recognition tasks (Zhao et al., 19 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Polyline Path Masking for ViTs (PPMA).