Polyline Path Masking for ViTs
- The paper presents a novel 2D polyline scanning mask for ViTs that preserves Manhattan spatial adjacency and enhances performance on classification, detection, and segmentation.
- It details a structured mask construction using vertical-then-horizontal and horizontal-then-vertical paths with learnable decay factors predicted by small MLPs.
- The study establishes a matrix factorization framework that reduces computational complexity from O(N^2) to O(N) under optimized conditions, enabling efficient attention operations.
Polyline Path Masked Attention (PPMA) is an architectural mechanism designed for Vision Transformers (ViTs) to explicitly encode spatial adjacency priors in 2D image feature maps through a structured masking strategy. PPMA integrates and enhances principles from both self-attention and state-space sequence models, particularly leveraging the structured mask philosophy of Mamba2. By introducing a 2D polyline (L-shaped) scanning mask, PPMA preserves the two-dimensional spatial relationships between patches more effectively than conventional “flatten-row” or “snake” 1D orderings. Empirical benchmarks demonstrate performance gains in classification, object detection, and segmentation over previous state-of-the-art models using both attention and state-space approaches (Zhao et al., 19 Jun 2025).
1. 2D Polyline Path Scanning: Motivation and Construction
Traditional 1D scan orderings for image patches—such as flattening by rows or snakes—disrupt local adjacency: two spatially adjacent patches in an image can become distant in the 1D sequence, weakening inductive bias for spatial continuity. PPMA addresses this by defining, for any pair of tokens and on a 2D feature map of size , an L-shaped polyline (the shortest Manhattan path) connecting source and target. There are two such routes:
- Vertical then horizontal (V2H):
- Horizontal then vertical (H2V):
The respective contributions from both V2H and H2V are summed to maintain symmetry. Decay factors (horizontal) and (vertical) are predicted per token using small MLPs followed by a ReLU and exponential activation, enabling learnable, spatially-varying locality control.
2. Formal Definition of the Polyline Path Mask
The core mask is represented as a 4D tensor , where the element aggregates the decay factors along both polyline routes:
After computation, 0 is unfolded into a standard 2D mask 1: 2
This construction directly preserves Manhattan distance-based adjacency and can be interpreted as a highly structured input-dependent positional bias.
3. Theoretical Analysis and Complexity
PPMA’s mask admits a matrix-decomposition theorem. For any such mask 3, if one can construct per-row matrices 4 and per-column matrices 5 such that 6, then 7 factorizes as 8 or, entrywise, as 9, where 0 denotes the Hadamard product.
This factorization enables a two-stage matrix-vector product with 1 time—reshaping the input to 2, multiplying each column by 3, then multiplying each row by 4, and stacking the results. When the constituent matrices are 1-semiseparable and chunkwise Mamba2 algorithms are employed, the total complexity can be further reduced to 5.
| Operation | Naive Complexity | Efficient (Factorized) Complexity |
|---|---|---|
| Fill 6 entrywise | 7 | 8 |
| Matrix-vector product 9 | 0 | 1 / 2 (chunkwise) |
The explicit construction provides not only accuracy benefits but also provably efficient computational routines suitable for high-resolution vision.
4. Efficient Computation Algorithm
The efficient computation proceeds through per-row and per-column scans leveraging the decomposition. Given input decay maps 3 and a vector 4:
9
This structure reduces space and runtime overhead relative to naive 5 attention masking.
5. Integration into Vision Transformer Attention
The polyline mask 6 is incorporated into the standard self-attention formula, yielding Polyline Path Masked Vanilla Attention (PPMVA):
7
Alternatively, 8 can be absorbed as an additive positional bias, compatible with various attention formulations, including softmax, linear kernel, criss-cross (sparse), and decomposed attention mechanisms. Efficient masked mat-vec operations are directly applied in all such cases.
6. Empirical Performance and Ablations
PPMA’s empirical evaluation encompasses ImageNet-1K (classification), COCO-2017 (object detection/segmentation), and ADE20K (semantic segmentation) using standard ViT backbones. Key metrics include top-1 classification accuracy, box/mask average precision (AP), and mean intersection-over-union (mIoU). Variants are defined by model scale:
| Variant | Params | FLOPs | ImageNet Top-1 | COCO Box AP/Mask AP | ADE20K mIoU |
|---|---|---|---|---|---|
| PPMA-T | 14M | 2.7G | 82.6% | 47.1% / 42.4% | 48.7% |
| PPMA-S | 27M | 4.9G | 84.2% | — | 51.1% |
| PPMA-B | 54M | 10.6G | 85.0% | 51.1% / 45.5% | 52.3% |
| RMT-B | 54M | 10.6G | 84.9% | 50.7% / 45.1% | 52.0% |
Ablation studies reveal compounded benefits through hierarchical enhancements, with maximal performance achieved by combining polyline masking with RMT decay, cross-scan, and both V2H/H2V paths. For example, on PPMA-T, combining all factors yields 82.60% accuracy and 48.73 mIoU.
7. Advantages and Limitations
PPMA explicitly encodes a 2D Manhattan adjacency prior, augmenting the standard global context modeling of ViTs with strong local continuity inductive bias. This structured mask is agnostic to the underlying attention type and can be integrated into diverse attention variants with minor computational adaptation. Theoretical decomposability underpins both correctness and computational efficiency.
Limitations include increased GPU memory demand and reduced throughput relative to 1D-masked state-space models (approximately 20–40% slower than RMT). Current implementation is provided in PyTorch, with suggested acceleration possible via fused CUDA/Triton kernels.
In summary, PPMA combines global self-attention with explicit, learnable spatial priors, supported by a suite of structural theorems and efficient algorithms, yielding empirically robust improvements in standard visual recognition tasks (Zhao et al., 19 Jun 2025).