Selective Stripe Position Encoder (S2PE)
- The paper introduces S²PE, which injects vertical positional information via lightweight 1D dilated convolution to regain lost 2D context in high-resolution whole-slide images.
- S²PE efficiently addresses the vertical context loss inherent in flattening WSIs by operating on stripe-wise segments, thereby reducing bias in row-major sequence models.
- Empirical studies show that integrating S²PE improves diagnostic AUC, ACC, and F1 scores in MIL frameworks, with ablation tests confirming its additive benefit over standard encoders.
The Selective Stripe Position Encoder (S²PE) is a positional encoding module designed for integration in state-space sequence models used for gigapixel whole-slide image (WSI) analysis under the multiple instance learning (MIL) paradigm. S²PE explicitly addresses the loss of vertical spatial context that occurs when high-resolution WSIs are flattened into long 1D feature sequences for processing by models such as Mamba, which inherently learn horizontal correlations but ignore the vertical structure of the original 2D data. S²PE employs lightweight stripe-wise 1D dilated convolution in the vertical direction, efficiently restoring vertical spatial continuity without incurring the redundancy or conflict of traditional 2D position encoding schemes.
1. Motivation and Background
High-resolution WSIs are typically decomposed into hundreds or tens of thousands of fixed-size patches and then converted into 1D sequences of embeddings for MIL frameworks. State-space architectures such as Mamba efficiently process these ultra-long sequences but intrinsically operate under a row-major scanning assumption, resulting in limited vertical context awareness due to sequential token updates focusing on horizontal (row-wise) continuity. Standard 2D positional encoding techniques, such as PEG, PPEG, and EPEG, introduce either excessive computation or degrade performance by clashing with the model’s inductive bias, as evidenced by a 0.8–2.1% AUC drop when used with MambaMIL+ (see Fig. 12 in (Zeng et al., 19 Dec 2025)).
S²PE was conceived to inject lightweight, vertical stripe-based positional information. By targeting only the vertical axis, S²PE aims to preserve and reconstitute lost 2D spatial dependencies and mitigate the artificial temporal gaps induced by flattening, without adding the overhead and redundancy of full 2D positional encoding.
2. Architectural Overview
S²PE operates within a multi-stage sequence reconstitution and encoding process, with key stages summarized in the following table.
| Stage | Operation | Output Shape |
|---|---|---|
| Overlapping Scanning | Decompose and flatten WSI patches | |
| Masking | Apply instance-level binary mask | |
| 1D-2D Reshaping | Map sequence to padded 2D grid | |
| Stripe-wise 1D Dilated Conv | Apply grouped 1D dilated convolution along | |
| Flatten Back | Convert 2D map back to 1D sequence | |
| Fusion | Add (or concatenate+project) with original features |
- Input: Overlapping-scanned patch embedding sequence with mask from Contextual Token Selection (CTS).
- Reshape (): Map 1D masked sequence to 2D feature map with (padding as needed).
- Stripe Encoding: Apply a single stripe-wise 1D dilated convolution (typically kernel size 3, dilation 2) to each vertical column, independently over , and shared across channels.
- Output: Flatten the convolved 2D output back to 1D and combine with the original embeddings via addition (or concatenation and projection) before feeding to Mamba blocks.
3. Mathematical Formulation and Algorithm
Let denote the overlapping-scanned embeddings and the binary mask. Define:
- is the reshape operator.
- is the inverse flattening.
- is element-wise product along the sequence dimension.
- applies 1D dilation over the (vertical) dimension, grouped by columns.
S²PE computes: where:
- (kernel size , dilation )
4. Bias Mitigation and Spatial Modeling
Sequential row-major scanning in Mamba induces a strong inductive bias, causing temporally distant treatment of vertically adjacent patches. S²PE’s stripe-wise vertical convolutions “short-circuit” this bias by establishing connections between vertical neighbors, thus faithfully modeling 2D spatial continuity otherwise lost in the 1D representation.
Pre-encoding masking via ensures that only foreground (tissue-relevant) tokens contribute positional signals, filtering out background and preventing spurious context propagation. This selective focusing yields a more robust positional representation focused on diagnostically salient regions.
5. Implementation Details and Hyperparameterization
- Embedding dimension (): Matches the output of the feature extractor (e.g., $2048$ for ResNet-50, $1024$ for PLIP, $512$ for CONCH).
- Reshape map dimensions (): Selected to form a nearly square grid accommodating all $4N$ tokens (zero-padding if ).
- Stripe convolution: Typical configuration is a single 1D dilated convolution layer (kernel size 3, dilation 2) per column over rows, implemented as grouped 1D convolution.
- Parameter initialization: Weights via Xavier initialization; biases to zero.
- Computational overhead: Negligible relative to Mamba’s sequence processing; only a lightweight convolution is applied per slide.
6. Empirical Impact and Ablation Evidence
Empirical benchmarks on 20 WSI datasets demonstrate substantial benefits of S²PE within the MambaMIL+ framework:
- Positional Encoder Benchmarking: S²PE alone improves AUC by \%, ACC by \%, and F1 by \% (statistically significant with ) on diagnostic classification tasks, outperforming PEG, PPEG, and EPEG (which degrade AUC by up to \% when combined with MambaMIL+) [Fig. 12, (Zeng et al., 19 Dec 2025)].
- Ablation Study: Removing S²PE (with overlap+CTS present) reduces average AUC from $89.9$% to $89.0$%, confirming the additive impact of the module. The combination “Overlap + CTS + S²PE” yields up to \% AUC, \% ACC, \% F1 over baseline [Table 10, (Zeng et al., 19 Dec 2025)].
- Downstream Tasks: Across diagnostic, molecular, and survival prediction, S²PE delivers $1.4$–$3.4$\% AUC gains and augments average C-Index by $0.6$–$1.5$\% in survival analysis when used with all three tested feature extractors.
7. Limitations and Prospective Extensions
While S²PE confers statistically significant improvements with minimal overhead, its design choices encode only vertical axis information. This may under-represent diagonal or global 2D patterns, especially in non-square or elongated tissue regions. Additionally, its kernel size and dilation are fixed, potentially limiting adaptability across diverse datasets.
Potential extensions include multi-axis stripe encoders (capturing diagonal or horizontal stripes), learnable stripe partitioning and dilation via backpropagation or attention, dynamic mask integration for adaptive positional filtering, and hybrid schemes that combine S²PE with Fourier-based or polynomial positional encodings for enhanced 2D context coverage.
For further technical exposition and empirical studies, see "MambaMIL+: Modeling Long-Term Contextual Patterns for Gigapixel Whole Slide Image" (Zeng et al., 19 Dec 2025).