Papers
Topics
Authors
Recent
Search
2000 character limit reached

InceptionMamba Architecture Overview

Updated 4 July 2026
  • InceptionMamba Architecture is a hybrid neural design that combines multi-scale Inception convolution with selective state space modeling for robust local and global feature extraction.
  • The segmentation variant employs an encoder–bottleneck–decoder setup with a Feature Calibration Module and Inception Mamba Module to enhance boundary precision and achieve high Dice scores with fewer GFLOPs.
  • The backbone variant replaces traditional strip convolutions with orthogonal band convolutions and integrates a bottleneck GlobalMixer, boosting performance on classification, detection, and segmentation benchmarks.

InceptionMamba is a name applied to two distinct 2025 neural architectures that combine Inception-style convolutional decomposition with Mamba-based selective state space modeling. One formulation is an encoder–bottleneck–decoder network for microscopic medical image segmentation, built around a ResNet50 feature hierarchy, a Feature Calibration Module (FCM), and an Inception Mamba Module (IMM) (Kareem et al., 13 Jun 2025). The other is a general-purpose hierarchical backbone derived from InceptionNeXt, replacing strip convolutions with orthogonal band convolutions and introducing a bottleneck Mamba-based GlobalMixer for classification and downstream vision tasks (Wang et al., 10 Jun 2025). In both cases, the architectural premise is the same: local multi-scale spatial encoding is handled by efficient depth-wise convolutional branches, while long-range dependency modeling is delegated to a selective state space component.

1. Nomenclature, scope, and common design premise

The term InceptionMamba does not denote a single canonical architecture. In the June 2025 literature, it names two separate models with different task scopes and internal organizations: "InceptionMamba: Efficient Multi-Stage Feature Enhancement with Selective State Space Model for Microscopic Medical Image Segmentation" (Kareem et al., 13 Jun 2025) and "InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba" (Wang et al., 10 Jun 2025). The former is a task-specific segmentation network; the latter is a backbone family for ImageNet-1K classification, COCO detection and instance segmentation, and ADE20K semantic segmentation.

Despite that divergence, the two works share a common architectural logic. Each combines an Inception-style multi-branch token mixer with a Mamba-derived state-space component rather than using a purely convolutional or purely attention-based design. In the segmentation model, the hybridization occurs inside IMM, where identity, depth-wise convolution, and Mamba branches are concatenated after a channel split (Kareem et al., 13 Jun 2025). In the backbone model, the hybridization is factorized into a ConvMixer for local spatial aggregation and a bottleneck SS2D/Mamba GlobalMixer for global context and channel interaction (Wang et al., 10 Jun 2025).

This parallel use of the same name has a practical implication for the literature: references to InceptionMamba must be disambiguated by task domain and arXiv identifier. A plausible implication is that the name is better understood as a design motif—Inception-style efficient spatial mixing plus selective state-space modeling—than as a single architecture family.

2. Segmentation-oriented InceptionMamba for microscopic medical imaging

The segmentation formulation is an efficient encoder–bottleneck–decoder network designed for microscopic medical image segmentation, with the stated aim of preserving fine boundaries while handling irregular shapes, scale changes, overlap, clutter, and blurred cell or tissue edges (Kareem et al., 13 Jun 2025). Its components are a CNN backbone for hierarchical multi-stage features, a Feature Calibration Module to enrich stage-wise semantics and sharpen blurry boundaries, a hybrid Inception Mamba Module, a lightweight decoder without dense skip connections, and a final fusion with low-level stem features before segmentation prediction.

Let the input image be

IRC×H×W.I \in \mathbb{R}^{C \times H \times W}.

A ResNet50 backbone extracts

Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},

where X0X_0 is the stem-layer feature map and X1,X2,X3X_1,X_2,X_3 are the first three ResNet stages. The architecture does not use the fourth ResNet stage, because the authors found the highest-level low-resolution features less beneficial for microscopic segmentation and more expensive (Kareem et al., 13 Jun 2025).

The high-level flow is

{X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}

followed by

Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),

then

Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.

The decoder processes XbX_b, and the final decoder output is fused with a projected version of the stem feature X0X_0 before the segmentation head predicts the mask MM (Kareem et al., 13 Jun 2025).

Before bottleneck fusion, the three backbone stages are spatially aligned. Stage-1 features are downsampled by a factor of 4, stage-2 features by a factor of 2, and stage-3 features are used directly:

Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},0

Each aligned stage is passed through an FCM,

Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},1

and the calibrated features are concatenated and projected into a unified bottleneck representation (Kareem et al., 13 Jun 2025).

The decoder is intentionally simple. It uses convolution, upsampling, an IMM inserted in the decoder, and more convolution and upsampling; the best IMM position is the middle of the decoder, with Dice scores of Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},2, Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},3, and Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},4 for first, second, and third placement respectively (Kareem et al., 13 Jun 2025). At the end, the decoder output is fused additively with a convolutional projection of the backbone stem feature:

Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},5

The paper does not describe an explicit attention-based fusion formula; fusion is mainly done by concatenation plus Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},6 convolution, residual addition, and final addition with stem features (Kareem et al., 13 Jun 2025).

3. Feature Calibration Module, IMM, and selective state-space integration

A central idea of the segmentation model is that stage-wise features are enriched by simultaneously capturing low-frequency and high-frequency cues to better separate overlapping structures and blurred boundaries (Kareem et al., 13 Jun 2025). This is implemented by the Feature Calibration Module.

For a stage feature map Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},7, FCM first constructs a smoothed feature by a downsample–upsample path:

Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},8

High-frequency detail is then extracted by subtraction,

Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},9

while low-frequency or blob emphasis is obtained by multiplication,

X0X_00

After convolutional transforms X0X_01 and X0X_02, concatenation, X0X_03 projection, and residual addition, the calibrated output is

X0X_04

This formulation explicitly encodes the paper’s distinction between boundary-enhancing high-frequency cues and region-emphasizing low-frequency cues (Kareem et al., 13 Jun 2025).

The Inception Mamba Module is the contextual block of the network. Given an input feature map X0X_05, IMM splits it channel-wise into five subsets:

X0X_06

The paper states that X0X_07 is an identity branch, X0X_08 are depth-wise convolution branches, and X0X_09 is a Mamba branch. The output is

X1,X2,X3X_1,X_2,X_30

with

X1,X2,X3X_1,X_2,X_31

The depth-wise branches use square and rectangular convolution kernels to obtain smaller and larger receptive fields, but the text does not enumerate exact kernel sizes (Kareem et al., 13 Jun 2025). Unlike classical Inception modules, pooling is replaced by identity; the authors explicitly state they empirically found this gives better segmentation performance than using max pooling inside IMM.

The Mamba branch is introduced to encode global representations, retain low computational cost relative to self-attention, and reduce semantic redundancy through the selective scan mechanism (Kareem et al., 13 Jun 2025). For a flattened sequence X1,X2,X3X_1,X_2,X_32, the standard selective state-space formulation used by Mamba is

X1,X2,X3X_1,X_2,X_33

with recurrence

X1,X2,X3X_1,X_2,X_34

The segmentation paper does not redefine Mamba internals; it explicitly relies on the standard Mamba design from Gu and Dao, and the reconstruction therefore treats these as the standard selective state-space equations associated with Mamba (Kareem et al., 13 Jun 2025).

An important misconception is that the segmentation model is Mamba-dominant. It is not. Mamba is used only as one branch inside IMM after channel splitting, while the remaining branches are identity and depth-wise convolutions. This selective placement is a primary efficiency mechanism, and the paper’s ablation reports that using only Mamba improves performance but increases GFLOPs, whereas using Inception depth-wise convolution plus Mamba improves more while keeping cost almost unchanged relative to baseline (Kareem et al., 13 Jun 2025).

4. Efficiency, ablations, and empirical results of the segmentation model

The segmentation paper reports state-of-the-art performance on SegPC21, GlaS, ISIC2017, and ISIC2018 while reducing computational cost by about 5 times compared to the previous best performing method (Kareem et al., 13 Jun 2025). On SegPC21, the reported complexity numbers are:

  • GA2-Net: X1,X2,X3X_1,X_2,X_35 M parameters and X1,X2,X3X_1,X_2,X_36 GFLOPs
  • InceptionMamba with ResNet50 backbone: X1,X2,X3X_1,X_2,X_37 M parameters and X1,X2,X3X_1,X_2,X_38 GFLOPs
  • InceptionMambaX1,X2,X3X_1,X_2,X_39 with PVT-V2-B2 backbone: {X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}0 M parameters and {X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}1 GFLOPs

The paper states that, compared with GA2-Net, the ResNet50-based InceptionMamba has about {X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}2 fewer parameters and about {X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}3 fewer GFLOPs:

{X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}4

This is the basis for the “about 5 times” computational saving (Kareem et al., 13 Jun 2025).

The ablation trajectory on SegPC21 is reported as follows.

Configuration Params / GFLOPs Dice
Baseline 11.42M / 5.27 89.05
Baseline + FCM + IMM 11.54M / 5.30 91.5
Baseline + FCM + IMM + decoder IMM 11.67M / 6.09 92.05
Final model 11.92M / 6.72 92.56

These numbers show that the performance gain is obtained with modest additional cost over the baseline while remaining much cheaper than the previously best-performing method (Kareem et al., 13 Jun 2025). The paper also compares self-attention replacements: Baseline + Self-Attention uses {X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}5 M parameters and {X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}6 GFLOPs, and Baseline + IMM with self-attention replacing Mamba uses {X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}7 M parameters and {X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}8 GFLOPs. This supports the claim that the selective state-space branch is more efficient than self-attention in the same context (Kareem et al., 13 Jun 2025).

Several architectural decisions are explicitly tied to this efficiency profile. The model uses only the first three backbone stages, excludes dense skip connections because they increase computational cost while yielding negligible gains, employs depth-wise convolutions in IMM, applies Mamba only to a split of channels, and uses a lightweight decoder with a single effective IMM placement (Kareem et al., 13 Jun 2025). Training details include PyTorch 2.1.1+cu118, a 32 GB Tesla V100 GPU, random rotation and random flipping as data augmentation, and the loss

{X1,X2,X3}stage alignment + FCM{Fˉ1,Fˉ2,Fˉ3}\{X_1,X_2,X_3\}\xrightarrow{\text{stage alignment + FCM}}\{\bar F_1,\bar F_2,\bar F_3\}9

For SegPC21, ISIC2017, and ISIC2018, the batch size is Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),0, the learning rate is Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),1, the optimizer is Adam, and training lasts Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),2 epochs; for GlaS, the batch size is Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),3, the initial learning rate is Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),4, the optimizer is Adam, and the protocol follows three times 5-fold cross-validation with ensemble averaging over the five models at inference (Kareem et al., 13 Jun 2025).

5. InceptionNeXt-derived InceptionMamba backbone

The second InceptionMamba is a hybrid CNN–SSM backbone built by taking InceptionNeXt as the starting point and improving two weaknesses identified by the authors: limited ability to capture spatial dependencies along different dimensions because of parallel one-dimensional strip convolutions, and limited global context modeling due to the locality of convolution operations (Wang et al., 10 Jun 2025). To address these, the architecture makes two core modifications to the InceptionNeXt block: ConvMixer replaces 1D strip convolutions with orthogonal band convolutions, and GlobalMixer inserts a bottleneck Mamba or SS2D module for global context and channel interaction.

For input

Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),5

an InceptionMamba block computes

Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),6

then normalization and a channel MLP with residual:

Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),7

where the MLP expansion ratio is Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),8 (Wang et al., 10 Jun 2025).

The ConvMixer splits channels into three groups:

Xr=Conv1×1(Concat[Fˉ1,Fˉ2,Fˉ3]),X_r=\text{Conv}_{1\times1}\big(\text{Concat}[\bar F_1,\bar F_2,\bar F_3]\big),9

then applies

Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.0

Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.1

Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.2

and concatenates

Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.3

The branch allocation ratio is reported as

Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.4

matching the architecture table’s conv group ratio of Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.5 (Wang et al., 10 Jun 2025). Compared with the strip convolutions of InceptionNeXt, the Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.6 and Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.7 band kernels provide wider two-dimensional support around each principal axis.

The GlobalMixer applies bottleneck Mamba in compressed channel space:

Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.8

where the bottleneck ratio is Xc=IMM(Xr),Xb=Xr+Xc.X_c=\text{IMM}(X_r),\qquad X_b=X_r+X_c.9 (Wang et al., 10 Jun 2025). The paper summarizes the underlying state-space model by the continuous-time equations

XbX_b0

their discretization

XbX_b1

and the recurrence

XbX_b2

An equivalent convolution form is also given through

XbX_b3

This is the theoretical basis for long-range, linear-complexity sequence modeling in the vision setting (Wang et al., 10 Jun 2025).

The full backbone follows a 4-stage hierarchical design with spatial resolutions XbX_b4, XbX_b5, XbX_b6, and XbX_b7, followed by global average pooling and an MLP classifier (Wang et al., 10 Jun 2025). The Tiny, Small, and Base variants use stage depths XbX_b8, XbX_b9, and X0X_00, respectively. Their total parameters and FLOPs are:

  • InceptionMamba-T: X0X_01M, X0X_02G
  • InceptionMamba-S: X0X_03M, X0X_04G
  • InceptionMamba-B: X0X_05M, X0X_06G (Wang et al., 10 Jun 2025)

6. Backbone ablations, downstream performance, and interpretive caveats

On ImageNet-1K, the reported top-1 accuracies are X0X_07, X0X_08, and X0X_09 for Tiny, Small, and Base (Wang et al., 10 Jun 2025). Relative to InceptionNeXt, the direct gains are reported as MM0, MM1, and MM2 at the corresponding scales, together with fewer parameters and lower FLOPs. On COCO with Mask R-CNN, InceptionMamba-T achieves MM3 box AP and MM4 mask AP at MM5M parameters and MM6G FLOPs; InceptionMamba-S achieves MM7 and MM8 at MM9M and Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},00G; InceptionMamba-B achieves Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},01 and Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},02 at Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},03M and Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},04G (Wang et al., 10 Jun 2025). On ADE20K with UperNet, the reported mIoUs are Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},05, Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},06, and Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},07 for Tiny, Small, and Base (Wang et al., 10 Jun 2025).

The ablation results separate the contributions of local and global components. For ConvMixer variants, the reported Top-1 values are Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},08 for DWConv Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},09, Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},10 for InceptionDWConv2d, Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},11 for strip convolution, and Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},12 for orthogonal band convolution, all at Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},13M parameters and Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},14G FLOPs (Wang et al., 10 Jun 2025). The gain over strip convolution is therefore modest but consistent. For GlobalMixer variants, the larger improvement comes from SS2D: no Bottleneck Mamba gives Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},15, Bottleneck + GELU gives Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},16, Bottleneck + DWConv7×7 gives Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},17, Bottleneck + attention gives Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},18, and Bottleneck + SS2D gives Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},19 (Wang et al., 10 Jun 2025).

A separate efficiency table reports:

  • SS2D: Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},20M, Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},21G, Top-1 Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},22, throughput Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},23 imgs/s
  • Bottleneck + SS2D: Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},24M, Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},25G, Top-1 Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},26, throughput Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},27 imgs/s

The text also states a 13.8% parameter reduction from Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},28M to Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},29M and a 15.2% FLOPs reduction from Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},30G to Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},31G (Wang et al., 10 Jun 2025). Taken together, these numbers support the paper’s framing of bottleneck SS2D as the preferred accuracy-efficiency trade-off among the tested global mixers.

Two caveats are integral to an accurate reading. First, the backbone paper explicitly notes an inconsistency in the GlobalMixer residual equation: it writes

Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},32

but the accompanying reconstruction says this appears inconsistent with the previous lines, and in context the intended residual should be the input plus the GlobalMixer-transformed feature (Wang et al., 10 Jun 2025). Second, the segmentation paper does not provide exact branch kernel sizes or exact channel allocation ratios inside IMM, whereas the backbone paper specifies the Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},33 and Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},34 band kernels and the Xi,i{0,1,2,3},X_i,\quad i \in \{0,1,2,3\},35 split (Kareem et al., 13 Jun 2025).

This distinction helps resolve a common misconception. The segmentation InceptionMamba is not a direct transplant of the InceptionNeXt-derived backbone, and the backbone InceptionMamba is not merely the segmentation network generalized to classification. They share a hybrid principle but instantiate it differently: one uses multi-stage calibration and selective branch-wise Mamba inside a decoder-oriented segmentation pipeline, while the other uses orthogonal band convolutions and bottleneck SS2D inside a four-stage hierarchical backbone. A plausible implication is that InceptionMamba is best read as a convergent architectural label for efficient CNN–SSM hybrids rather than a single, unified model class.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InceptionMamba Architecture.