High-Order SS2D Extension

Updated 21 February 2026

SS2D extension is a high-order generalization of the 2D selective-scan operator that recursively applies gated SS2D layers for enhanced spatial mixing.
The methodology integrates a Local-SS2D module combining 3×3 convolution and SS2D paths to refine representations while maintaining linear computational complexity.
Empirical evaluations, such as those in H-vmunet, demonstrate improved metrics like a +1–2% Dice gain and reduced parameter count in segmentation tasks.

A variety of research communities employ the term "SS2D Extension" to refer to the enhancement, adaptation, or high-order generalization of the two-dimensional Selective-Scan operator (SS2D) and closely related spectral or data assimilation frameworks. The term's usage is notably prominent in vision state-space modeling, high-order spectral-difference algorithms, computational electromagnetics, and wave-propagation inverse problems. Across these domains, “extension” denotes both strict mathematical generalizations and modular architectural augmentations, enabling increased expressivity, higher accuracy, or improved computational efficiency.

1. Core Definition: SS2D and Its High-Order Extensions

The SS2D operator originates in the state-space modeling (SSM) paradigm for 2D data, most notably in vision backbones such as Vision Mamba and its UNet instantiations. The vanilla SS2D operator transforms an input tensor $X\in\mathbb{R}^{H\times W\times C}$ along four principal spatial scan directions, applies an SSM block per direction, and merges the results. Formally,

$\mathrm{SS2D}(X) = \frac{1}{4} \sum_{d=1}^4 \mathrm{InvScan}_d\,\circ\,\mathrm{S6}\,\circ\,\mathrm{Scan}_d(X)$

where $\mathrm{S6}$ denotes the Mamba SSM block.

A high-order SS2D extension (“H-SS2D”) applies SS2D recursively in an $n$ -stage cascade. Each stage gates the input with local enhancements via a Local-SS2D module, suppressing redundancy and incrementally refining representation. The $k$ -th stage is

$X_{k+1} = \mathrm{SS2D}\bigl(X_{k}\odot \mathrm{LSD}(Y_k)\bigr).$

Here, $Y_k$ is an auxiliary stream, $\odot$ denotes element-wise multiplication, and $\mathrm{LSD}$ is a local enhancement submodule combining $3\times3$ convolution and SS2D paths with normalization. This process scales linearly in $n$ , retaining overall $O(NC)$ complexity in spatial size $N=H\cdot W$ and channel $C$ , as shown in "H-vmunet: High-order Vision Mamba UNet for Medical Image Segmentation" (Wu et al., 2024).

2. Mathematical Formalism and Architectural Mechanics

The high-order SS2D extension can be described as follows:

Project the input into a main stream $X_0$ and $n$ auxiliary gates $Y_0,\ldots,Y_{n-1}$ .
For each order $k=0,\ldots,n-1$ $k = 0, \dots, n - 1$ :
- Apply Local-SS2D gating: $X_{k}\odot \mathrm{LSD}(Y_k)$ .
- Run SS2D for global spatial mixing.
Project back to the output via a learned projection.

The Local-SS2D module takes $U\in\mathbb{R}^{N\times C}$ , splits channels, applies $3\times3$ convolution and SS2D to the respective halves, concatenates, and normalizes: $U_1, U_2 = \mathrm{Split}(\mathrm{LN}(U)),\quad V_1 = \mathrm{Conv}_{3\times3}(U_1),\quad V_2 = \mathrm{SS2D}(U_2),\quad \mathrm{LSD}(U) = \mathrm{LN}(\mathrm{Concat}(V_1, V_2)).$

For $n=1$ , this construction reduces to a first-order gating; for $n>1$ , the recursive structure incrementally filters background or redundant activations, enhancing region discriminability and local-detail preservation (Wu et al., 2024).

3. Computational Complexity and Empirical Performance

Both vanilla and high-order SS2D implementations scale linearly with spatial size.

Vanilla SS2D: $O(NC + Nf(C))$ per forward pass, where $f(C)$ is cubic/quadratic in $C$ (depending on SSM implementation).
$n$ -order H-SS2D: $O(n[Time(\mathrm{SS2D})]+n\cdot NC)$ due to $n$ cascaded SS2D layers and corresponding LSD modules.

Memory footprint remains $O(NC)$ . Using $n=2,3,4,5$ yields a computational cost roughly $n\times$ baseline SS2D but remains below that of quadratic-attention Transformers. Empirical evaluation in "H-vmunet" demonstrates a $\sim$ 67% parameter reduction and $+1\text{–}2\%$ Dice coefficient improvement over Vision-Mamba U-Net and other competitive U-Net variants across ISIC2017, Spleen, and CVC-ClinicDB segmentation benchmarks (Wu et al., 2024).

The “SS2D extension” concept is not restricted to visual SSMs. High-order and hybrid 2D scanning concepts also appear in:

Sliding-mesh spectral difference methods (“SSD/SS2D methods”): High-order accurate curved-mortar interfaces for rotating–stationary grid coupling in CFD, with strict conservation and parallel efficiency (Zhang et al., 2015).
Surface-integral equation solvers in 2D electromagnetics (SS-SIE/SS2D): Modular extensions for generalized complex media, supporting arbitrary connections, nonconformal meshes, and robust field equivalence (Zhu et al., 2021).
Inverse wave-propagation problems: “2D SS extension” refers to extension-operator regularizations and preconditioners (spatially distributed, soft-constrained surface sources), with fast Krylov solvers leveraging time reversal for efficient minimization (Symes, 2022).
Cross-modal 2D SS2D: In cross-modal state-space modeling (e.g., RGB-thermal segmentation), “CM-SS2D” interleaves and couples multiple feature streams, generalizing scan, parameter generation, and hidden-state updates to fuse modalities with linear complexity (Guo et al., 22 Jun 2025).

5. Practical Integration: Pseudocode and Layer Deployment

The extension is typically deployed in modular architectures, e.g., inside U-Net encoder/decoder blocks (as in H-vmunet). A generic high-order SS2D block applies $n$ sequential gated SS2D passes, with LSD applied to each auxiliary stream. An illustrative Python-like pseudocode is:

def H_SS2D_Block(x, n):  # x: [B,H,W,C]
    x_flat = reshape(x, [B, N=H*W, C])
    streams = LinearProj(x_flat)   # [B,N,2C]
    X = streams[..., :C]
    idx = C
    for k in range(n):
        Ck = C // 2**(n-k-1)
        Yk = streams[..., idx:idx+Ck]
        idx += Ck
        Gk = Local_SS2D(Yk)
        X = SS2D(X * broadcast(Gk))
    out_flat = LinearProjOut(X)
    out = reshape(out_flat, [B, H, W, C])
    return out

Each U-Net stage may employ a different

n

. LSD is typically realized with a split-normalization-conv/SS2D-concat-normalization sequence (Wu et al., 2024).

High-order SS2D extensions offer several technical advantages:

Redundancy suppression: Recursively gated feature streams emphasize salient structures, reducing spurious activations and background noise.
Local–global balance: LSD modules restore spatial detail that pure SSM global scanning may overlook.
Linear scaling: All variants are $O(N)$ in spatial resolution; only the constant factor increases with $n$ .
Parameter/memory economy: Compared to quadratic-attention networks and non-modular SSMs, H-SS2D significantly reduces model size and computational overhead.
Empirical and theoretical refinement: H-SS2D inherits incremental information-refinement properties from analogous high-order SSM gating designs (e.g., HoRNet) (Wu et al., 2024).

7. Comparative Table: High-Order SS2D Extension vs. Baseline SS2D

Property	Vanilla SS2D	High-order H-SS2D
Number of scan passes	4 directions	n × 4 directions
Local detail preservation	Weak (global scan only)	Strong (via LSD gate)
Complexity (per stage)	$O(NC)$	$n \cdot O(NC)$
Parameter count (typical)	baseline	baseline / 3 (approx)
Empirical Dice gain	–	+1–2% (medical seg.)

This table summarizes the key architectural, computational, and empirical differences as established in "H-vmunet" (Wu et al., 2024).

The notion of "SS2D extension" thus encapsulates a family of advances wherein the 2D selective-scan construction is generalized to higher order, more expressive, or more computationally scalable forms. Such extensions leverage recursive multi-stream gating, modular assembly, and hybrid local-global processing to overcome the limitations of both classical SSMs and contemporary quadratic-complexity attention architectures in visual and scientific computing.