InceptionMamba Architecture Overview
- InceptionMamba Architecture is a hybrid neural design that combines multi-scale Inception convolution with selective state space modeling for robust local and global feature extraction.
- The segmentation variant employs an encoder–bottleneck–decoder setup with a Feature Calibration Module and Inception Mamba Module to enhance boundary precision and achieve high Dice scores with fewer GFLOPs.
- The backbone variant replaces traditional strip convolutions with orthogonal band convolutions and integrates a bottleneck GlobalMixer, boosting performance on classification, detection, and segmentation benchmarks.
InceptionMamba is a name applied to two distinct 2025 neural architectures that combine Inception-style convolutional decomposition with Mamba-based selective state space modeling. One formulation is an encoder–bottleneck–decoder network for microscopic medical image segmentation, built around a ResNet50 feature hierarchy, a Feature Calibration Module (FCM), and an Inception Mamba Module (IMM) (Kareem et al., 13 Jun 2025). The other is a general-purpose hierarchical backbone derived from InceptionNeXt, replacing strip convolutions with orthogonal band convolutions and introducing a bottleneck Mamba-based GlobalMixer for classification and downstream vision tasks (Wang et al., 10 Jun 2025). In both cases, the architectural premise is the same: local multi-scale spatial encoding is handled by efficient depth-wise convolutional branches, while long-range dependency modeling is delegated to a selective state space component.
1. Nomenclature, scope, and common design premise
The term InceptionMamba does not denote a single canonical architecture. In the June 2025 literature, it names two separate models with different task scopes and internal organizations: "InceptionMamba: Efficient Multi-Stage Feature Enhancement with Selective State Space Model for Microscopic Medical Image Segmentation" (Kareem et al., 13 Jun 2025) and "InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba" (Wang et al., 10 Jun 2025). The former is a task-specific segmentation network; the latter is a backbone family for ImageNet-1K classification, COCO detection and instance segmentation, and ADE20K semantic segmentation.
Despite that divergence, the two works share a common architectural logic. Each combines an Inception-style multi-branch token mixer with a Mamba-derived state-space component rather than using a purely convolutional or purely attention-based design. In the segmentation model, the hybridization occurs inside IMM, where identity, depth-wise convolution, and Mamba branches are concatenated after a channel split (Kareem et al., 13 Jun 2025). In the backbone model, the hybridization is factorized into a ConvMixer for local spatial aggregation and a bottleneck SS2D/Mamba GlobalMixer for global context and channel interaction (Wang et al., 10 Jun 2025).
This parallel use of the same name has a practical implication for the literature: references to InceptionMamba must be disambiguated by task domain and arXiv identifier. A plausible implication is that the name is better understood as a design motif—Inception-style efficient spatial mixing plus selective state-space modeling—than as a single architecture family.
2. Segmentation-oriented InceptionMamba for microscopic medical imaging
The segmentation formulation is an efficient encoder–bottleneck–decoder network designed for microscopic medical image segmentation, with the stated aim of preserving fine boundaries while handling irregular shapes, scale changes, overlap, clutter, and blurred cell or tissue edges (Kareem et al., 13 Jun 2025). Its components are a CNN backbone for hierarchical multi-stage features, a Feature Calibration Module to enrich stage-wise semantics and sharpen blurry boundaries, a hybrid Inception Mamba Module, a lightweight decoder without dense skip connections, and a final fusion with low-level stem features before segmentation prediction.
Let the input image be
A ResNet50 backbone extracts
where is the stem-layer feature map and are the first three ResNet stages. The architecture does not use the fourth ResNet stage, because the authors found the highest-level low-resolution features less beneficial for microscopic segmentation and more expensive (Kareem et al., 13 Jun 2025).
The high-level flow is
followed by
then
The decoder processes , and the final decoder output is fused with a projected version of the stem feature before the segmentation head predicts the mask (Kareem et al., 13 Jun 2025).
Before bottleneck fusion, the three backbone stages are spatially aligned. Stage-1 features are downsampled by a factor of 4, stage-2 features by a factor of 2, and stage-3 features are used directly:
0
Each aligned stage is passed through an FCM,
1
and the calibrated features are concatenated and projected into a unified bottleneck representation (Kareem et al., 13 Jun 2025).
The decoder is intentionally simple. It uses convolution, upsampling, an IMM inserted in the decoder, and more convolution and upsampling; the best IMM position is the middle of the decoder, with Dice scores of 2, 3, and 4 for first, second, and third placement respectively (Kareem et al., 13 Jun 2025). At the end, the decoder output is fused additively with a convolutional projection of the backbone stem feature:
5
The paper does not describe an explicit attention-based fusion formula; fusion is mainly done by concatenation plus 6 convolution, residual addition, and final addition with stem features (Kareem et al., 13 Jun 2025).
3. Feature Calibration Module, IMM, and selective state-space integration
A central idea of the segmentation model is that stage-wise features are enriched by simultaneously capturing low-frequency and high-frequency cues to better separate overlapping structures and blurred boundaries (Kareem et al., 13 Jun 2025). This is implemented by the Feature Calibration Module.
For a stage feature map 7, FCM first constructs a smoothed feature by a downsample–upsample path:
8
High-frequency detail is then extracted by subtraction,
9
while low-frequency or blob emphasis is obtained by multiplication,
0
After convolutional transforms 1 and 2, concatenation, 3 projection, and residual addition, the calibrated output is
4
This formulation explicitly encodes the paper’s distinction between boundary-enhancing high-frequency cues and region-emphasizing low-frequency cues (Kareem et al., 13 Jun 2025).
The Inception Mamba Module is the contextual block of the network. Given an input feature map 5, IMM splits it channel-wise into five subsets:
6
The paper states that 7 is an identity branch, 8 are depth-wise convolution branches, and 9 is a Mamba branch. The output is
0
with
1
The depth-wise branches use square and rectangular convolution kernels to obtain smaller and larger receptive fields, but the text does not enumerate exact kernel sizes (Kareem et al., 13 Jun 2025). Unlike classical Inception modules, pooling is replaced by identity; the authors explicitly state they empirically found this gives better segmentation performance than using max pooling inside IMM.
The Mamba branch is introduced to encode global representations, retain low computational cost relative to self-attention, and reduce semantic redundancy through the selective scan mechanism (Kareem et al., 13 Jun 2025). For a flattened sequence 2, the standard selective state-space formulation used by Mamba is
3
with recurrence
4
The segmentation paper does not redefine Mamba internals; it explicitly relies on the standard Mamba design from Gu and Dao, and the reconstruction therefore treats these as the standard selective state-space equations associated with Mamba (Kareem et al., 13 Jun 2025).
An important misconception is that the segmentation model is Mamba-dominant. It is not. Mamba is used only as one branch inside IMM after channel splitting, while the remaining branches are identity and depth-wise convolutions. This selective placement is a primary efficiency mechanism, and the paper’s ablation reports that using only Mamba improves performance but increases GFLOPs, whereas using Inception depth-wise convolution plus Mamba improves more while keeping cost almost unchanged relative to baseline (Kareem et al., 13 Jun 2025).
4. Efficiency, ablations, and empirical results of the segmentation model
The segmentation paper reports state-of-the-art performance on SegPC21, GlaS, ISIC2017, and ISIC2018 while reducing computational cost by about 5 times compared to the previous best performing method (Kareem et al., 13 Jun 2025). On SegPC21, the reported complexity numbers are:
- GA2-Net: 5 M parameters and 6 GFLOPs
- InceptionMamba with ResNet50 backbone: 7 M parameters and 8 GFLOPs
- InceptionMamba9 with PVT-V2-B2 backbone: 0 M parameters and 1 GFLOPs
The paper states that, compared with GA2-Net, the ResNet50-based InceptionMamba has about 2 fewer parameters and about 3 fewer GFLOPs:
4
This is the basis for the “about 5 times” computational saving (Kareem et al., 13 Jun 2025).
The ablation trajectory on SegPC21 is reported as follows.
| Configuration | Params / GFLOPs | Dice |
|---|---|---|
| Baseline | 11.42M / 5.27 | 89.05 |
| Baseline + FCM + IMM | 11.54M / 5.30 | 91.5 |
| Baseline + FCM + IMM + decoder IMM | 11.67M / 6.09 | 92.05 |
| Final model | 11.92M / 6.72 | 92.56 |
These numbers show that the performance gain is obtained with modest additional cost over the baseline while remaining much cheaper than the previously best-performing method (Kareem et al., 13 Jun 2025). The paper also compares self-attention replacements: Baseline + Self-Attention uses 5 M parameters and 6 GFLOPs, and Baseline + IMM with self-attention replacing Mamba uses 7 M parameters and 8 GFLOPs. This supports the claim that the selective state-space branch is more efficient than self-attention in the same context (Kareem et al., 13 Jun 2025).
Several architectural decisions are explicitly tied to this efficiency profile. The model uses only the first three backbone stages, excludes dense skip connections because they increase computational cost while yielding negligible gains, employs depth-wise convolutions in IMM, applies Mamba only to a split of channels, and uses a lightweight decoder with a single effective IMM placement (Kareem et al., 13 Jun 2025). Training details include PyTorch 2.1.1+cu118, a 32 GB Tesla V100 GPU, random rotation and random flipping as data augmentation, and the loss
9
For SegPC21, ISIC2017, and ISIC2018, the batch size is 0, the learning rate is 1, the optimizer is Adam, and training lasts 2 epochs; for GlaS, the batch size is 3, the initial learning rate is 4, the optimizer is Adam, and the protocol follows three times 5-fold cross-validation with ensemble averaging over the five models at inference (Kareem et al., 13 Jun 2025).
5. InceptionNeXt-derived InceptionMamba backbone
The second InceptionMamba is a hybrid CNN–SSM backbone built by taking InceptionNeXt as the starting point and improving two weaknesses identified by the authors: limited ability to capture spatial dependencies along different dimensions because of parallel one-dimensional strip convolutions, and limited global context modeling due to the locality of convolution operations (Wang et al., 10 Jun 2025). To address these, the architecture makes two core modifications to the InceptionNeXt block: ConvMixer replaces 1D strip convolutions with orthogonal band convolutions, and GlobalMixer inserts a bottleneck Mamba or SS2D module for global context and channel interaction.
For input
5
an InceptionMamba block computes
6
then normalization and a channel MLP with residual:
7
where the MLP expansion ratio is 8 (Wang et al., 10 Jun 2025).
The ConvMixer splits channels into three groups:
9
then applies
0
1
2
and concatenates
3
The branch allocation ratio is reported as
4
matching the architecture table’s conv group ratio of 5 (Wang et al., 10 Jun 2025). Compared with the strip convolutions of InceptionNeXt, the 6 and 7 band kernels provide wider two-dimensional support around each principal axis.
The GlobalMixer applies bottleneck Mamba in compressed channel space:
8
where the bottleneck ratio is 9 (Wang et al., 10 Jun 2025). The paper summarizes the underlying state-space model by the continuous-time equations
0
their discretization
1
and the recurrence
2
An equivalent convolution form is also given through
3
This is the theoretical basis for long-range, linear-complexity sequence modeling in the vision setting (Wang et al., 10 Jun 2025).
The full backbone follows a 4-stage hierarchical design with spatial resolutions 4, 5, 6, and 7, followed by global average pooling and an MLP classifier (Wang et al., 10 Jun 2025). The Tiny, Small, and Base variants use stage depths 8, 9, and 0, respectively. Their total parameters and FLOPs are:
- InceptionMamba-T: 1M, 2G
- InceptionMamba-S: 3M, 4G
- InceptionMamba-B: 5M, 6G (Wang et al., 10 Jun 2025)
6. Backbone ablations, downstream performance, and interpretive caveats
On ImageNet-1K, the reported top-1 accuracies are 7, 8, and 9 for Tiny, Small, and Base (Wang et al., 10 Jun 2025). Relative to InceptionNeXt, the direct gains are reported as 0, 1, and 2 at the corresponding scales, together with fewer parameters and lower FLOPs. On COCO with Mask R-CNN, InceptionMamba-T achieves 3 box AP and 4 mask AP at 5M parameters and 6G FLOPs; InceptionMamba-S achieves 7 and 8 at 9M and 00G; InceptionMamba-B achieves 01 and 02 at 03M and 04G (Wang et al., 10 Jun 2025). On ADE20K with UperNet, the reported mIoUs are 05, 06, and 07 for Tiny, Small, and Base (Wang et al., 10 Jun 2025).
The ablation results separate the contributions of local and global components. For ConvMixer variants, the reported Top-1 values are 08 for DWConv 09, 10 for InceptionDWConv2d, 11 for strip convolution, and 12 for orthogonal band convolution, all at 13M parameters and 14G FLOPs (Wang et al., 10 Jun 2025). The gain over strip convolution is therefore modest but consistent. For GlobalMixer variants, the larger improvement comes from SS2D: no Bottleneck Mamba gives 15, Bottleneck + GELU gives 16, Bottleneck + DWConv7×7 gives 17, Bottleneck + attention gives 18, and Bottleneck + SS2D gives 19 (Wang et al., 10 Jun 2025).
A separate efficiency table reports:
- SS2D: 20M, 21G, Top-1 22, throughput 23 imgs/s
- Bottleneck + SS2D: 24M, 25G, Top-1 26, throughput 27 imgs/s
The text also states a 13.8% parameter reduction from 28M to 29M and a 15.2% FLOPs reduction from 30G to 31G (Wang et al., 10 Jun 2025). Taken together, these numbers support the paper’s framing of bottleneck SS2D as the preferred accuracy-efficiency trade-off among the tested global mixers.
Two caveats are integral to an accurate reading. First, the backbone paper explicitly notes an inconsistency in the GlobalMixer residual equation: it writes
32
but the accompanying reconstruction says this appears inconsistent with the previous lines, and in context the intended residual should be the input plus the GlobalMixer-transformed feature (Wang et al., 10 Jun 2025). Second, the segmentation paper does not provide exact branch kernel sizes or exact channel allocation ratios inside IMM, whereas the backbone paper specifies the 33 and 34 band kernels and the 35 split (Kareem et al., 13 Jun 2025).
This distinction helps resolve a common misconception. The segmentation InceptionMamba is not a direct transplant of the InceptionNeXt-derived backbone, and the backbone InceptionMamba is not merely the segmentation network generalized to classification. They share a hybrid principle but instantiate it differently: one uses multi-stage calibration and selective branch-wise Mamba inside a decoder-oriented segmentation pipeline, while the other uses orthogonal band convolutions and bottleneck SS2D inside a four-stage hierarchical backbone. A plausible implication is that InceptionMamba is best read as a convergent architectural label for efficient CNN–SSM hybrids rather than a single, unified model class.