Papers
Topics
Authors
Recent
2000 character limit reached

Channel Mamba Block (CMB) Overview

Updated 16 December 2025
  • Channel Mamba Block (CMB) is a neural network module that utilizes selective state-space models to perform global channel mixing with linear-complexity, enabling efficient long-range feature interaction.
  • CMBs replace traditional channel-mixing operators, such as convolution, MLP, and self-attention, by employing data-adaptive SSM transformations across the channel axis for improved performance.
  • Empirical studies show that CMB implementations boost metrics like Dice score, PSNR, and accuracy in tasks including segmentation, super-resolution, and forecasting while maintaining favorable computational efficiency.

A Channel Mamba Block (CMB) is a neural network module that performs global or long-sequence mixing across the channel dimension of a tensor using variants of the Mamba family of selective state space models (SSMs). CMBs have emerged as a powerful alternative to convolution- and attention-based channel mixers, enabling linear-complexity cross-channel coupling and global receptive fields for each feature, primarily in vision, time series, and sequence modeling. The CMB terminology encompasses both general Mamba-in-channel-SSM architectures and specific algorithmic instantiations developed for semantic segmentation, image super-resolution, long-range sequence processing, time series forecasting, and image restoration.

1. Core Principles and Architectural Patterns

The defining principle of the Channel Mamba Block is to replace or augment traditional channel-mixing operators—such as convolution, MLP, or self-attention—with an SSM-based transformation acting along the channel axis or flattened channel-major sequences.

Typical CMB construction follows three stages:

  • Reshape (or permute): To make the channel axis the sequence dimension—e.g., for a feature tensor of shape (B,C,H,W)(B, C, H, W), spatial dimensions may be flattened so the input becomes (B,L,C)(B, L, C) for sequence length L=H×WL = H \times W.
  • Channel-wise SSM application: A Mamba or bidirectional SSM, often parameterized via input-dependent dynamics, is deployed over this channel-axis or channel-sequence, enabling long-range and data-adaptive channel interactions.
  • Fusion and residual paths: Output from the channel SSM may be combined with pointwise linear or convolutional projections and further processed with gating or feed-forward subnetworks; residual connections are standard for stability and expressivity.

Distinct CMB instantiations may employ additional mechanisms such as block-wise gating, channel-specific biases, bidirectional sweeps, or hybrid spatial-channel decompositions.

2. Mathematical Formulations and Block Variants

Channel Mamba Blocks can be rigorously formulated in the SSM framework, with both discrete and input-dependent parameterizations. Below are representative equations and definitions extracted from the literature:

Given xinRB×C×H×Wx_{in}\in\mathbb R^{B\times C\times H\times W}:

  1. Reshape: f1(xin)RB×L×Cf_1(x_{in})\in\mathbb R^{B\times L\times C}, L=HWL=H\cdot W.
  2. Two-branch processing:
    • Branch A: LIFM_BlockLIFM\_Block, a two-stage factorized Mamba with intermediate SiLU nonlinearity.
    • Branch B: Linear projection WcW_c.
  3. Elementwise fusion and reshape:

xout=f1[  LIFM_Block(f1(xin))  +  Wc(f1(xin))  ]x_{out} = \overline{f_1}\left[\; LIFM\_Block(f_1(x_{in}))\;+\; W_c(f_1(x_{in})) \;\right]

Each NC-MambaBlock applies:

NC-MambaBlock(x)=W3[SiLU(W2x)    SSMD(SiLU(W1x))]\text{NC-MambaBlock}(x) = W_3 \cdot \left[ \text{SiLU}(W_2 x)\;\odot\; \text{SSM}_D( \text{SiLU}(W_1 x) ) \right]

where SSMD\text{SSM}_D is a bidirectional sweep, W1,W2,W3W_1, W_2, W_3 are linear layers.

For SRB×k×CS\in\mathbb R^{B\times k\times C}:

  • First LayerNorm.
  • Two-branch subblock:
    • DDBM: bidirectional Mamba SSM over the channel or spatial sequence.
    • Linear + depthwise convolution.
  • Fusion, residual, LayerNorm, and Spatial-Gate FFN.

Mathematically,

Stmp=LP[DDBM(LN(S))+DWConv(LP(LN(S)))]+S Scmb=SGFN(LN(Stmp))+StmpS_{tmp} = LP [ DDBM(LN(S)) + DWConv(LP(LN(S))) ] + S \ S_{cmb} = SGFN ( LN(S_{tmp}) ) + S_{tmp}

Partition dd channels into hh blocks of size pp (so d=hpd=h p). For each block jj and channel ii: jΔk(i)=softplus((jw)juk+jb(i)) ht=Ak(i)ht1+Bk(i)uk(i);yt=Ckht{}^{j}\Delta_{k}^{(i)} = \mathrm{softplus}( ({}^{j}w)^\top\,{}^{j}u_k + {}^{j}b^{(i)} ) \ h_t = \overline{A}_k^{(i)}\, h_{t-1} + \overline{B}_k^{(i)}\, u_k^{(i)}; \quad y_t = \overline{C}_k\, h_t with bias, gating, and blockwise SSM structure.

This block uniquely restores universal approximation by adding channel-specific bias and blockwise decoupling, mitigating expressivity and inductive bias limits of the original Mamba S6.

After a temporal M-Mamba SSM:

  • Channel mixing is performed by a “global data-dependent MLP,” which generates per-sample weights and biases for each channel via global pooling. No self-attention or fixed MLP is used.
  • Channel Mixup is applied as a data-augmentation, mixing across channels randomly (during training) to encourage robustness.
  • Feature tensor XRH×W×CX\in\mathbb R^{H\times W\times C} rearranged to treat channel axis as SSM “time”: X^TRC×L\hat X^T\in\mathbb R^{C\times L}.
  • SelectiveSSM: discretized SSM with learned, input-dependent parameters for each channel position, fused via gating and convolutional post-processing.

3. Computational Complexity and Scaling Properties

Channel Mamba Blocks are linear in both feature sequence length and channel dimension, i.e., O(LC)O(LC). This property contrasts with the O(C2)O(C^2) (channel attention) and O(L2)O(L^2) (spatial attention) complexity of standard self-attention. For example:

  • In CAMS-Net, each CMB stage—including both Mamba-SSM and residual projections—maintains O(LC)O(LC) cost, as all state-space recurrences and linear layers are efficiently parallelizable (Khan et al., 9 Jun 2024).
  • In MPSI, the bidirectional Mamba and convolutional branches both have linear time and memory in k=HWk=H\cdot W and CC (He et al., 10 Dec 2024).
  • In Block-Biased Mamba (B2S6B_2S_6), each block is parallelizable, and total work is O(Ldn)O(L d n), with nn the SSM width (Yu et al., 13 May 2025).

All surveyed CMB variants avoid quadratic blow-up in any axis and do not require explicit position encodings to achieve long-range dependence modeling.

4. Empirical Performance and Ablation Analysis

Multiple studies provide quantitative ablations isolating the impact of the Channel Mamba Block:

  • CAMS-Net (Cardiac segmentation): Adding the channel Mamba aggregator alone gives +1.5% average Dice over the no-aggregator baseline; combining spatial and channel Mamba blocks yields an even higher aggregation gain (total 4.8% over the vanilla architecture) (Khan et al., 9 Jun 2024).
  • MPSI (Super-resolution): Grafting CMB onto a spatial attention baseline increases Urban100 ×4 PSNR from 26.00 dB (STB only) to 26.16 dB, with visible improvements in local detail and a larger SSIM jump versus channel self-attention (CA) (He et al., 10 Dec 2024).
  • Block-Biased Mamba (B2S6B_2S_6): Achieves 87.71% mean accuracy on Long-Range Arena, outperforming S4 and S4D; ablation confirms over 10 point drop in accuracy by omitting the bias term, pinpointing the necessity of channel-specific bias for expressiveness (Yu et al., 13 May 2025).
  • CMamba (Multivariate time series forecasting): CMamba is consistently top-2 on seven benchmarks; GDD-MLP for channel mixing plus channel mixup lowers validation loss versus MLP or Mixup-only baselines (Zeng et al., 8 Jun 2024).
  • CU-Mamba (Image Restoration): Use of the channel SSM block in isolation provides a +0.62 dB PSNR boost over deep ResBlock baseline on GoPro deblurring, and further improvements accrue when combined with the spatial SSM (Deng et al., 17 Apr 2024).

A common thread is that CMBs yield performance improvements over both non-mixing and standard channel-mixing operators, with empirical gains generally in the 0.5–1.5% (Dice/PSNR) range per task.

5. Comparison to Convolutional, Self-Attention, and Other Channel-Mixing Modules

Several CMB studies highlight direct advantages or design tradeoffs with respect to established channel-mixing modules:

  • Self-Attention: While self-attention captures global channel correlations, it incurs O(C2)O(C^2) cost, and ablations in time series and super-resolution tasks show CMBs match or exceed CA module accuracy and visual quality at substantially reduced computational cost (He et al., 10 Dec 2024, Zeng et al., 8 Jun 2024).
  • MLP: Vanilla, position-independent MLPs are competitive in simplicity but lack data adaptivity and generalize poorly when input distributions shift—CMBs’ use of SSMs or MLPs parameterized by global descriptors (GDD-MLP) regain data dependence and long-range coupling (Zeng et al., 8 Jun 2024).
  • Convolution: Channel CMBs replace convolutional mixing—common in ResNets and U-Nets—with SSMs, achieving nonlocal coupling and global receptive fields, and enable convolution- and attention-free architectures as in CAMS-Net (Khan et al., 9 Jun 2024).

This suggests that CMBs are especially advantageous in settings where inter-channel dependencies cannot be reliably modeled with static weights or local kernels.

6. Theoretical Properties: Expressiveness, Bias, and Stability

Recent work formalizes the expressiveness and optimization regularity of CMBs with input-dependent and blockwise SSM parameterizations:

  • Universal Approximation: S4D is universal for sequence-to-real tasks; vanilla Mamba (S6), in contrast, is not, due to lack of per-channel parameter width. The Block-Biased architecture (B2S6B_2S_6) regains universality by introducing blockwise gating and channel-specific biases, avoiding collapse to non-expressive encoders (Yu et al., 13 May 2025).
  • Inductive Bias: In vanilla S6, large-magnitude inputs may induce exponential decay or memory locking, while blockwise structure in B2S6B_2S_6 yields only polynomial dropoff in sequence gradients, favoring more robust long-range information integration.
  • Stability: Gradient magnitudes in S6 (Mamba) scale superlinearly with input magnitudes and sequence length, whereas S4D and B2S6B_2S_6 scale more gently; this can be mitigated in practice by lowering the learning rate for SSM hyperparameters (Yu et al., 13 May 2025).

7. Application Domains and Integration Patterns

CMBs have been integrated into a variety of architectures and tasks:

Paper/Model Application Domain CMB Integration Pattern
CAMS-Net Cardiac image segmentation Channel aggregator in U-Net
MPSI Single image super-resolution Channel-level module in SAMB
Block-Biased Mamba Long-range sequence modeling Channel block in Mamba SSMs
CMamba Multivariate time series GDD-MLP channel-mixer after SSM
CU-Mamba Image restoration (denoising, deblur) Channel SSM in U-shaped net

In all cases, CMBs are placed after or in parallel with spatial/contextual feature extraction stages, typically at every encoder and decoder level. Channel SSMs are especially complementary to their spatial SSM counterparts, resulting in efficient dual-directional modeling.


In summary, the Channel Mamba Block is a general channel-mixing architecture that leverages state space models, most often of the Mamba or block-biased SSM form, to achieve global, data-adaptive, and efficient channel correlation learning. By doing so, CMBs overcome limitations of convolutional, MLP, and self-attention-based mixers, delivering both theoretical and empirical advances in vision, time series, and sequential data domains (Khan et al., 9 Jun 2024, He et al., 10 Dec 2024, Yu et al., 13 May 2025, Zeng et al., 8 Jun 2024, Deng et al., 17 Apr 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Channel Mamba Block (CMB).