Omni Selective Scan (OSS) for Vision SSMs
- Omni Selective Scan (OSS) is a mechanism that enhances the spatial modeling of visual state space models by enabling efficient, multi-directional scans.
- It performs independent directional scans—horizontal, vertical, diagonal, and channel-wise—to enable robust global and local feature propagation with linear computational complexity.
- OSS integrates a directional scan module with an O-Attention fusion mechanism, significantly boosting performance in applications like image restoration and semantic segmentation.
Omni Selective Scan (OSS) is a mechanism for enhancing the spatial modeling capacity of visual state space models (SSMs). OSS addresses the critical limitation of unidirectional or causally sequenced SSMs by enabling efficient, bidirectional, and multi-directional information flow across two-dimensional grid structures and channel dimensions, while maintaining linear computational complexity. OSS underpins recent vision architectures such as VmambaIR and OCTOPUS, facilitating strong global and local feature propagation in a computationally efficient manner and resulting in state-of-the-art performance across various low-level and high-level vision tasks (Shi et al., 2024, Mahatha et al., 31 Jan 2026).
1. Foundations: State Space Models and Visual Sequence Modeling
State space models (SSMs) are rooted in control theory and are defined by continuous or discrete time dynamics that map sequences of inputs through a hidden state to output . The discretized evolution for standard SSMs is governed by: where matrices are learned. Although SSMs such as S4, S5, and Mamba provide efficient long-range sequence modeling with linear time and memory complexity, naïve application to images via rasterization undermines local spatial relationships and fails to propagate information isotropically across the 2D grid. This causal, 1D formulation links non-adjacent pixels while simultaneously ignoring direct neighbors, impeding spatial coherence crucial for vision tasks (Shi et al., 2024, Mahatha et al., 31 Jan 2026).
2. Multi-Directional and Omni-Directional Feature Propagation
Omni Selective Scan (OSS) generalizes the recurrence mechanism of SSMs by performing independent, discrete scans in multiple directions. In VmambaIR, OSS performs six bidirectional scans: horizontal forward/backward, vertical forward/backward, and channel-wise forward/backward (Shi et al., 2024). In OCTOPUS, OSS extends this further to eight principal spatial orientations: right (), left (), down (), up (), southeast (), northwest (0), southwest (1), and northeast (2) (Mahatha et al., 31 Jan 2026).
Each scan processes a set of independent 1D lines (rows, columns, or diagonals for spatial dimensions; channels for depth), applying SSM recurrences: 3 where 4 indexes direction, 5 indexes scan-line, and 6 is a learned gate (Mahatha et al., 31 Jan 2026). All directions are processed independently in parallel, preserving strict 7 complexity.
3. OSS Block Structure, Traversal Selection, and Fusion
The OSS block comprises a directional scan module and an efficient feature fusion scheme. Each directional scan outputs a set of features aligned to 2D pixel locations. After all directions are processed, a traversal selection (O-Attention) mechanism fuses the multi-directional context at each spatial location. Specifically, for each pixel 8:
- The outputs from all scanned directions 9 are stacked into 0, where 1 is the number of directions.
- Two 2 convolutions (or linear layers) compute scores 3, followed by a softmax normalization over directions, yielding attention weights 4.
- The fused output is 5 (Mahatha et al., 31 Jan 2026).
In VmambaIR, additional channel-wise SSM scans are incorporated after spatial fusion, followed by a 6 convolutional projection (Shi et al., 2024).
Alongside the OSS module, the Efficient Feed-Forward Network (EFFN) operates on the output, comprising a 7 expansion, depthwise convolution, gated linear unit, and final 8 projection. This structure enables nonlinear and cross-channel mixing at low computational cost (Shi et al., 2024).
4. Computational Complexity and Efficiency
Unlike the quadratic complexity of transformer self-attention (9 for an 0 image), OSS’s total complexity is linear in the number of patches: 1 with 2 and 3 constant (6 for VmambaIR, 8 for OCTOPUS). All components (scan, gating, O-Attention) scale as 4 (Shi et al., 2024, Mahatha et al., 31 Jan 2026). Empirically, OSS in VmambaIR reported only a 5 FLOP increase over single-direction SSM while substantially expanding the model’s 2D and channel context (Shi et al., 2024).
5. Architectural Integration and Practical Deployment
OSS blocks are modular and readily integrated into hierarchical architectures. In VmambaIR, a four-stage U-Net variant is used:
- Encoder: sequential OSS blocks at progressively reduced spatial resolutions.
- Decoder: upsampling, additional OSS blocks, and skip concatenations.
- Refinement: multiple OSS blocks at full resolution followed by a pixel-shuffle or convolutional output module, depending on the task (Shi et al., 2024).
In OCTOPUS, OSS is the foundational layer for vision SSMs, replacing standard raster-scan or unidirectional recurrence with true multi-directional propagation. Traversal selection is key to adaptively fusing the multi-orientation outputs at each pixel (Mahatha et al., 31 Jan 2026).
6. Empirical Performance and Analysis
OSS enables state-of-the-art results in both image restoration and semantic segmentation:
- VmambaIR achieves 29.99 dB (Urban100, 4× SR), outperforming BebyGAN (29.19 dB), with LPIPS 0.0496 vs 0.0529, and demonstrates significant efficiency gains: 27.06 dB (NTIRE2020, 4× real SR) using 10.5 M parameters and 20.5 G FLOPs, compared to MM-RealSR’s 25.19 dB/26.13 M/78.6 G (Shi et al., 2024).
- On Rain100H deraining, VmambaIR attains 31.66 dB/0.909 SSIM, exceeding Restormer’s 31.46 dB/0.904 with lower computational cost (Shi et al., 2024).
- Ablations confirm the importance of both planar and channel-wise OSS scanning; removing planar or channel scanning reduces PSNR by 0.43 dB and 0.14 dB, respectively (Shi et al., 2024).
- OCTOPUS demonstrates substantial improvements on segmentation (ADE20K single-scale mIoU: 37.93% for Octopus-T vs 22.77% for VMamba-T), cleaner object boundaries, and improved region consistency. Classification accuracy on miniImageNet also increases compared to previous vision SSMs (Octopus-T: 86.60% Top-1 vs 85.82% for VMamba-T) (Mahatha et al., 31 Jan 2026).
An analysis of the effective receptive field in OCTOPUS indicates the emergence of isotropic, eight-spoked coverage, superior to the window-based localities of Swin transformer and anisotropy of VMamba, reflecting OSS’s enhancement of 2D spatial awareness (Mahatha et al., 31 Jan 2026).
7. Significance and Perspectives
By overcoming the causality and locality constraints of standard SSMs, OSS establishes a path for scalable, spatially-aware, and efficient vision architectures. Its ability to tightly couple global context modeling and local spatial coherence, while maintaining strict linear complexity and plug-and-play architectural integration, positions OSS as a foundational operator for next-generation visual SSMs. The demonstrated empirical gains in restoration and segmentation, together with interpretability through effective receptive field analyses, underscore OSS’s impact in both theoretical modeling and practical system performance (Shi et al., 2024, Mahatha et al., 31 Jan 2026).
| Aspect | VmambaIR (6 directions) | OCTOPUS (8 directions) |
|---|---|---|
| Spatial scan directions | H/W ±, Channels ± | All axes ±, diagonals ± |
| Fusion mechanism | Addition and projection | Traversal selection (O-Attention) |
| Core SSM type | Mamba | Mamba |
| Complexity per pass | 6 | 7 |
| Empirical improvement | SR/Derain SOTA, efficient | Segmentation/classification boost |