Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 47 tok/s

Gemini 2.5 Pro 44 tok/s Pro

GPT-5 Medium 13 tok/s Pro

GPT-5 High 12 tok/s Pro

GPT-4o 64 tok/s Pro

Kimi K2 160 tok/s Pro

GPT OSS 120B 452 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba (2506.08735v3)

Published 10 Jun 2025 in cs.CV

Abstract: Within the family of convolutional neural networks, InceptionNeXt has shown excellent competitiveness in image classification and a number of downstream tasks. Built on parallel one-dimensional strip convolutions, however, it suffers from limited ability of capturing spatial dependencies along different dimensions and fails to fully explore spatial modeling in local neighborhood. Besides, inherent locality constraints of convolution operations are detrimental to effective global context modeling. To overcome these limitations, we propose a novel backbone architecture termed InceptionMamba in this study. More specifically, the traditional one-dimensional strip convolutions are replaced by orthogonal band convolutions in our InceptionMamba to achieve cohesive spatial modeling. Furthermore, global contextual modeling can be achieved via a bottleneck Mamba module, facilitating enhanced cross-channel information fusion and enlarged receptive field. Extensive evaluations on classification and various downstream tasks demonstrate that the proposed InceptionMamba achieves state-of-the-art performance with superior parameter and computational efficiency. The source code will be available at https://github.com/Wake1021/InceptionMamba.

Summary

The paper introduces a hybrid CNN architecture combining Inception-style multi-branch design with a bottleneck Mamba module for enhanced spatial and global modeling.
It employs orthogonal large-kernel band convolutions within a ConvMixer module to efficiently capture local spatial features.
Experimental results on ImageNet, COCO, and ADE20K confirm state-of-the-art performance with reduced computational costs.

InceptionMamba: An Efficient Hybrid Network

The paper introduces InceptionMamba (2506.08735), a novel CNN architecture that combines the strengths of InceptionNeXt and Mamba to achieve state-of-the-art performance with improved parameter and computational efficiency. The key innovation lies in replacing traditional one-dimensional strip convolutions with orthogonal band convolutions for cohesive spatial modeling and incorporating a bottleneck Mamba module to facilitate inter-channel information fusion and enlarged receptive field. This hybrid approach aims to leverage the efficient parallel structure of Inception-style networks while enhancing global contextual modeling capabilities through the Mamba architecture.

Architectural Overview

The InceptionMamba architecture, illustrated in (Figure 1), maintains a four-stage hierarchical structure inspired by ConvNeXt and InceptionNeXt.

Figure 1: InceptionMamba employs a hierarchical architecture of four consecutive stages, each consisting of a patch embedding layer or a downsampling module, combined with $N_i$ InceptionMamba blocks.

Each stage comprises a patch embedding layer or a downsampling module, followed by a series of InceptionMamba blocks. The input tensor undergoes processing through a ConvMixer module for local spatial information encoding and a Global Mixer module for global context modeling and capturing long-range dependencies. Normalization and MLP modules are then applied to refine the aggregated features. The core InceptionMamba block, the cornerstone of the architecture, incorporates carefully designed orthogonal band convolutions within the ConvMixer and a bottleneck Mamba module within the GlobalMixer.

Key Components and Implementation Details

ConvMixer with Large-Kernel Band Convolutions

The ConvMixer module addresses the limitations of one-way large kernels by employing a multi-branch structure inspired by SLaK and InceptionNeXt. This parallel structure enables the acceleration of large-kernel depthwise convolutions for efficient local spatial modeling. The key innovation here is the replacement of one-dimensional strip convolutions, used in InceptionNeXt, with orthogonal large-kernel band convolutions. Band convolutions can focus on larger areas to improve local modeling.

The input features, $X$ , are divided into three groups along the channel dimension: $X_{square}$ , $X_{band}$ , and $X_{identity}$ . These groups are then processed in parallel through different branches: a $3 \times 3$ depthwise convolution for $X_{square}$ , a combination of $3 \times 11$ and $11 \times 3$ depthwise convolutions for $X_{band}$ , and an identity mapping for $X_{identity}$ . The outputs of these branches are then concatenated to form the output feature map.

GlobalMixer with Bottleneck Mamba

To enhance effective cross-channel interaction and reduce channel redundancy, a Bottleneck Mamba structure is introduced. This structure incorporates a $1 \times 1$ convolution to compress and expand feature channels, facilitating efficient cross-channel fusion in a low-dimensional space while retaining key information. A state-space module (SS2D) is integrated within the bottleneck architecture to strengthen global modeling capacities while optimizing computational efficiency through channel compression. The GlobalMixer module formulates the input $X^{'}$ as follows:

$\begin{aligned} X^{''} &= \text{Conv}_{1\times1}^{C \to C/r}(X^{'})) \ X^{''} &= \text{SS2D}(X^{''}) \ X^{''} &= \text{Conv}_{1\times1}^{C/r \to C}(X^{''}) \ Y &= X + X^{'} \end{aligned}$

where $r$ is a channel compression ratio.

Block Comparisons

(Figure 2) compares the building blocks of ConvNeXt, InceptionNeXt, and InceptionMamba.

Figure 2: Comparison of ConvNeXt, InceptionNeXt, and InceptionMamba blocks, highlighting the orthogonal band convolutions and bottleneck Mamba structure for improved spatial and global modeling.

The InceptionMamba block allows efficient computation of large-kernel depthwise convolutions via a parallel multi-branch structure inherited from InceptionNeXt. InceptionMamba features orthogonal band convolutions and a GlobalMixer module with a bottleneck Mamba structure to capture long-range dependencies through cross-channel interaction. ConvNeXt and InceptionNeXt exhibit limited global modeling capacity due to the absence of a GlobalMixer-like component for sufficient information fusion.

Experimental Results

The effectiveness of InceptionMamba was validated through extensive experiments on image classification, object detection, and semantic segmentation tasks using benchmark datasets such as ImageNet-1K, MS-COCO, and ADE20K.

Image Classification on ImageNet

On ImageNet-1K, InceptionMamba-B achieved 84.7\% top-1 accuracy with 83M parameters and 14.3G FLOPs, surpassing existing state-of-the-art models, including InceptionNeXt and Mamba-like frameworks such as MambaVision and VMamba. InceptionMamba-T attained 83.1\% Top-1 accuracy with 25.4M parameters and 4.0G FLOPs, outperforming both Mamba and CNN models like VMamba-T (82.2\%) and InceptionNeXt-T (82.3\%).

Object Detection and Instance Segmentation on COCO

For object detection and instance segmentation on MS-COCO, InceptionMamba consistently outperformed competing architectures, including ConvNeXt, Swin-Transformer, and Mamba-based networks. Specifically, InceptionMamba-T achieved 46.0\% $AP^b$ and 41.8\% $AP^m$ with 43M parameters and 233G FLOPs, surpassing MambaOut-T and ConvNeXt-T by significant margins. The larger InceptionMamba-B model reported the highest 48.1\% $AP^b$ and 43.1\% $AP^m$ , outperforming MambaOut-B and ConvNeXt-B while reducing computational costs.

Semantic Segmentation on ADE20K

On the ADE20K dataset, InceptionMamba-B achieved 50.1\% mIoU with 110M parameters and 1145G FLOPs, outperforming competing networks in both accuracy and efficiency. These results highlight the potential of the hybrid architecture in combining local-aware encoding and global context modeling.

Ablation Studies

Ablation studies were conducted to evaluate the impact of different components and configurations within InceptionMamba. These studies examined the effects of branch ratio within ConvMixer, the benefits of the bottleneck structure within GlobalMixer, and the contribution of different modules involving the bottleneck. The results demonstrated that the proposed ConvMixer design, the bottleneck structure in GlobalMixer, and the integration of SS2D contribute to the overall performance and efficiency of the InceptionMamba architecture. (Figure 3) shows substantial redundancies in channel information.

Figure 3: Visualized feature maps along consecutive channels in an intermediate layer of a pretrained VMamba model illustrating substantial redundancies in channel information.

Qualitative comparisons using CAMs, shown in (Figure 4), further demonstrated the advantages of InceptionMamba in capturing object-aware regions more comprehensively and accurately compared to ConvNeXt and InceptionNeXt.

Figure 4: Comparison of CAMs generated from different mainstream architectures, demonstrating InceptionMamba's ability to characterize key semantic-aware regions.

Conclusion

The InceptionMamba architecture effectively combines the strengths of InceptionNeXt and Mamba, achieving state-of-the-art performance with improved efficiency across various vision tasks. The orthogonal band convolutions and bottleneck Mamba module contribute to enhanced spatial modeling and long-range dependency capture. The extensive experiments and ablation studies validate the effectiveness of the proposed hybrid architecture, offering a promising direction for designing lightweight yet powerful vision backbones.