Cross-Plane Mamba Blocks Overview

Updated 18 December 2025

Cross-plane Mamba blocks are advanced architectural units using linear-time state-space models (SSMs) to achieve efficient, context-aware modeling across multiple axes.
They employ pooling, unpooling, and selective scanning to exchange information among spatial, temporal, and spectral dimensions while substantially lowering computational complexity.
Empirical results demonstrate significant improvements in throughput, FLOPs reduction, and enhanced performance in tasks such as image synthesis, video super-resolution, and hyperspectral classification.

Cross-plane Mamba blocks are advanced architectural units leveraging structured state-space models (SSMs) to enable efficient, global, and context-aware modeling across multiple axes or “planes” in various data modalities, including images, volumes, videos, hyperspectral cubes, and spectro-temporal signals. By using linear-time state updates and selective scan mechanisms, cross-plane Mamba blocks facilitate information exchange between spatial, spectral, temporal, and channel dimensions with substantially reduced computational cost compared to quadratic-complexity self-attention. Modern cross-plane Mamba designs systematically interleave 1D SSM operations along different axes (e.g., rows and columns, slices and channels, time and frequency), use pooling and unpooling to mediate dimensionality, and, in some cases, propagate state summaries across sequential blocks or modalities.

1. Mathematical Foundations and Core Mechanisms

Cross-plane Mamba blocks are built on linear, input-dependent, discrete-time SSMs. The prototypical update for a sequence $u_t \in \mathbb{R}^d$ across length- $T$ is: $\begin{aligned} h_{t+1} &= \bar A_t h_t + \bar B_t u_t,\qquad h_0 = 0 \ y_t &= C_t h_t + D_t u_t \end{aligned}$ where all coefficients may be learned or dynamically selected per-timestep (token-dependent, “selective scan” as in Mamba). This scan admits linear-time implementation and can be parallelized via the Blelloch prefix sum for further acceleration.

The cross-plane extension manipulates the axis along which the sequence is constructed. For images, tokens are grouped and pooled along columns or rows; for 3D volumes, neighboring slices are stacked; for video and spectro-temporal signals, spatial and temporal axes or frequency bands are independently reordered and processed.

Incorporating cross-block or cross-layer “memory” can be achieved by initializing the state of each subsequent block with the terminal state of its predecessor: $h^{(\ell)}_0 = \mathcal{T}^{(\ell)}\bigl(h^{(\ell-1)}_T\bigr)$ where $\mathcal{T}$ is a typically identity boundary map, enabling differentiable gradient flow across blocks and enforcing architectural recurrence (Chavan et al., 14 Nov 2025).

2. Variants: Design Patterns and Data Modalities

The cross-plane Mamba concept has been instantiated for diverse data types, with each mapping the “plane” abstraction to relevant dimensions:

Vision (Images and Volumes): Fast Vision Mamba alternates pooling over image rows and columns (“Pool $_x$ , Pool $_y$ ”), processes linearized row/column vectors with SSM, and repeats the process to cover the 2D grid (Kapse et al., 1 Feb 2025). TranSamba applies cross-plane Mamba to aggregate context among contiguous medical slices by interleaving tokens from adjacent slices, applying Mamba SSM, and redistributing the outputs to each individual slice (Lyu et al., 11 Dec 2025).
Video: VSRM alternates “Spatial-to-Temporal” and “Temporal-to-Spatial” Mamba blocks by suitably flattening frame-spatial positions and temporal positions, then using forward/backward SSM scans to mix features globally in each direction. This achieves joint spatial and temporal fusion in linear time (Tran et al., 28 Jun 2025).
Hyperspectral Imaging: SS-Mamba blocks process spatial tokens (patches across an image) and spectral tokens (across frequency bands), updating each stream with separate Mamba SSMs and injecting cross-stream (spatial-to-spectral and vice versa) modulation (Huang et al., 29 Apr 2024).
Speech (Spectro-Temporal): BiCrossMamba-ST applies separate bidirectional Mamba stacks to spectral sub-bands and temporal intervals, then fuses information via mutual cross-attention, which is crucial for robust speech deepfake detection (Kheir et al., 20 May 2025).

These variants frequently employ pooling (downsampling) along one axis to reduce sequence length prior to SSM application—subsequently unpooling (repeating) the results—thereby drastically limiting the number of necessary recurrent steps.

3. Computational Complexity and Efficiency

A principal advantage of cross-plane Mamba blocks is the reduction in computational and memory complexity compared to conventional attention-based architectures. Key trade-offs can be summarized as follows:

Architecture	Per-Block Complexity	Memory	Key Feature
Standard Transformer (Attention)	$O(L^2 d + L d^2)$	$O(L^2)$	Quadratic global token interaction
Mamba (Single Plane)	$O(L d + L d^2)$	$O(L d)$	Linear, local/global context depending on scan
Cross-Plane Mamba (with pooling)	$O(M d + M d^2)$ , $M \ll L$	$O(M d)$	Linear, global mixing across projected axis

In Fast Vision Mamba, sequential state-space scan steps drop from $O(\log (h^2)) = 2\log h$ to $O(\log h)$ via alternating pooling axes, delivering up to $72.5\%$ throughput increase in $2048^2$ -resolution image processing and $35$– $38.5\%$ FLOPs savings (Kapse et al., 1 Feb 2025).

When cross-plane blocks are embedded within hybrid architectures (e.g., TranSamba), the overall layer cost becomes linear in volumetric depth $N$ , with total batch memory independent of $N$ , outperforming any cross-plane self-attention design which would scale as $O(N^2)$ (Lyu et al., 11 Dec 2025).

4. Functional and Architectural Roles

Cross-plane Mamba blocks serve several critical roles across applications:

Efficient Cross-Context Fusion: By reshaping and scanning along orthogonal axes, these blocks exchange information globally among otherwise independent sub-sequences, yielding long-range dependencies with linear complexity.
Directional Prior and State Propagation: In recurrence-enabled variants such as Arcee, terminal state-space representations are propagated across depth to encode a mild directional prior, as is particularly effective in generative vision models (Chavan et al., 14 Nov 2025).
Hybridization with Attention: Many architectures alternate cross-plane Mamba (SSM) layers with Transformer self-attention modules, using the former for global context along hard-to-model directions (e.g., slice, band, or frame), and the latter for in-plane, quadratic-complexity mixing—thereby capturing both volumetric and planar dependencies at optimal cost (Lyu et al., 11 Dec 2025).
Parameter Efficiency and Plug-and-Play: Cross-plane memory mechanisms such as Arcee’s state handoff are parameter-free and demand only constant overhead per block, facilitating rapid adoption across a wide spectrum of Vision Mamba variants (Chavan et al., 14 Nov 2025).

5. Empirical Results and Domain-Specific Applications

Cross-plane Mamba blocks underpin performance improvements across vision, medical imaging, video, speech, and remote sensing:

Image Synthesis (Arcee): Incorporating cross-block state reuse in Mamba yields a $5.4\times$ reduction in FID on CelebA-HQ $256\times256$ for unconditional generation when compared to strictly causal, zero-initialized block variants (Chavan et al., 14 Nov 2025).
High-Resolution Vision: FastVim attains significant throughput (>70%) and FLOPs ($35$– $38.5\%$ ) reductions with neutral or improved Top-1/ImageNet accuracy, and comparable or superior results on semantic segmentation, detection, and masking tasks (Kapse et al., 1 Feb 2025).
Volumetric Medical Segmentation: TranSamba, leveraging CPM, achieves absolute improvements of $+19$ points DSC and $+15.4$ points IoU over in-plane-only Mamba/ViT baselines, establishing new state-of-the-art across multiple datasets, with further gains found to saturate at optimal slice window size $N=16$ (Lyu et al., 11 Dec 2025).
Video Super-Resolution: In VSRM, the spatial-to-temporal and temporal-to-spatial arrangement delivers superior PSNR on REDS4 and Vimeo-90K-T compared to IART/windowed Transformers, with linear time complexity and global receptive field (Tran et al., 28 Jun 2025).
HSI Classification: SS-Mamba outperforms self-attention Transformers on widely used hyperspectral datasets due to efficient joint modeling of spectral and spatial structures (Huang et al., 29 Apr 2024).
Speech Deepfake Detection: BiCrossMamba-ST achieves dramatic reductions in Equal Error Rate (EER) versus prior state-of-the-art baselines (AASIST, RawBMamba) by separately and jointly modeling spectro-temporal axes via bidirectional Mamba and mutual cross-attention (Kheir et al., 20 May 2025).

6. Implementation Strategies and Architectural Trade-offs

Key practical considerations when deploying cross-plane Mamba blocks include:

Pooling/Unpooling Patterns: To balance information flow with compute efficiency, blocks alternate pooling axes and grid transposition, ensuring every pair of blocks covers both spatial dimensions in vision stacks (Kapse et al., 1 Feb 2025).
Boundary Map Selection: Simple choices such as identity handoff ( $\mathcal{T} = \mathrm{Id}$ ) confer end-to-end recurrence without learnable parameters, but more general differentiable maps are possible (Chavan et al., 14 Nov 2025).
Parallelization: Forward and backward SSM scans, bidirectional stacking, and Blelloch-style parallel prefix are all used to maximize hardware utilization and minimize latency.
Hybridization: Cross-plane blocks are used in conjunction with or as efficient surrogates for quadratic self-attention, typically in axes or modalities where quadratic cost is prohibitive.
Empirical Tuning: The number of cross-plane tokens (slices, patches, etc.) is a key hyperparameter; optimal values vary by task but are empirically shown to maximize cross-context modeling at moderate context widths (Lyu et al., 11 Dec 2025).

The cross-plane Mamba paradigm differentiates itself from alternative sequence modeling techniques by its efficient use of input-adaptive state-space models, its plug-and-play interface for cross-context aggregation, and its ability to scale to high-dimensional, high-resolution settings otherwise infeasible for Transformers. It generalizes concepts found in channel-mixing, slice-wise and patch-wise architectures, and hybrid SSM-Attention models, while leveraging unique strengths in parallelization and memory efficiency.

While quadratic self-attention provides model-agnostic full-context interaction, its unfavorable scaling is prohibitive beyond modest input sizes. CNNs remain local unless deeply stacked. Variant SSM approaches without adaptive or recurrent cross-block propagation lose the generative and context propagation gains observed in Mamba-based designs (Chavan et al., 14 Nov 2025, Kapse et al., 1 Feb 2025). Cross-plane Mamba blocks, through block-wise memory and global scan, yield a favorable blend of context, efficiency, and flexibility across domains.