InceptionNeXt-Tiny: Efficient CNN Design

Updated 5 December 2025

The paper introduces InceptionNeXt-Tiny, which decomposes large-kernel depthwise convolutions into four specialized branches, enhancing training (+57%) and inference (+20%) speeds.
InceptionNeXt-Tiny is a convolutional neural network with a four-stage design that employs Inception-style depthwise convolutions to optimize channel allocation and receptive field coverage.
Empirical evaluations on ImageNet-1K reveal a competitive top-1 accuracy gain (+0.2 pt) and set a new economical baseline for efficient model architecture in image classification and segmentation.

InceptionNeXt-Tiny (InceptionNeXt-T) is a convolutional neural network architecture designed for efficient image classification and dense prediction tasks, retaining the macro-structure of ConvNeXt while introducing an Inception-style depthwise convolution to achieve superior computational throughput with competitive accuracy. The central feature is the decomposition of expensive large-kernel depthwise convolutions into four specialized parallel branches with a precise channel allocation, optimizing both memory access patterns and receptive field coverage. The model sets a new baseline for economical architecture design, particularly when maximizing resource efficiency without sacrificing empirical performance (Yu et al., 2023).

1. Network Structure and Stagewise Design

InceptionNeXt-T adheres to a four-stage architecture analogous to ConvNeXt. The pipeline is as follows:

Stem: A $4\times4$ convolution, stride 4, embeds the input RGB image into $C_1 = 96$ channels, outputting feature maps of shape $\frac{H}{4}\times\frac{W}{4}\times 96$ .
Stage 1: Input resolution $\frac{H}{4}\times\frac{W}{4}$ , $C_1=96$ channels, 3 blocks. Each block employs an Inception depthwise convolution (IDConv) and an MLP with an expansion ratio of 4, followed by normalization layers (LayerNorm or BatchNorm).
Stage 2: Downsampling via $2\times2$ convolution (stride 2) to $C_2=192$ channels, resolution $\frac{H}{8}\times\frac{W}{8}$ , 3 blocks (IDConv, MLP ratio 4).
Stage 3: Downsample to $C_3=384$ channels, resolution $\frac{H}{16}\times\frac{W}{16}$ , 9 blocks (IDConv, MLP ratio 4).
Stage 4: Downsample to $C_4=768$ channels, resolution $\frac{H}{32}\times\frac{W}{32}$ , 3 blocks (IDConv, MLP ratio 3).
Head: Concludes with global average pooling and a linear classifier with 1000 output classes.

This sequence preserves high spatial resolution in early stages, increases abstraction and channel width stagewise, and terminates in standard classification layers.

2. Inception Depthwise Convolution (IDConv)

The principal innovation is the Inception depthwise convolution (IDConv), which replaces ConvNeXt's monolithic $k\times k$ depthwise convolutions. The process includes:

Branch Decomposition: Given input $X \in \mathbb{R}^{H \times W \times C}$ and branch ratio $r_g=1/8$ , the split allocates $g=r_g C$ channels to each of three branches ( $C_1, C_2, C_3$ ), with the remainder to the identity branch ( $C_4 = C - 3g$ ).
Branch Operations:

Square kernel: $3\times 3$ DWConv
Horizontal band kernel: $1\times 11$ DWConv
Vertical band kernel: $11\times 1$ DWConv
Identity mapping: passes through the remaining channels unchanged

Concatenation: Outputs from all branches are concatenated, reconstructing the feature tensor $Y \in \mathbb{R}^{H \times W \times C}$ .

For InceptionNeXt-T: $C=96$ , $g=12$ , yielding branch allocations $C_1=C_2=C_3=12$ , $C_4=60$ ; kernel sizes $k_s=3$ , $k_b=11$ .

This decomposition allows the network to approximate the large receptive field of $7\times 7$ depthwise convolution through efficient aggregation of spatially diverse filters.

3. Computational Complexity and Throughput

The theoretical and empirical resource demands are outlined below:

Model	Params (M)	MACs (G)	Train Throughput (img/s)	Inference Throughput (img/s)
ConvNeXt-T (k=7)	~28	~4.2	575	2413
InceptionNeXt-T	~28	~4.2	901 (+57%)	2900 (+20%)

Training was measured on an A100 GPU with full precision and batch size 128. InceptionNeXt-T demonstrates a +57% increase in training throughput, and +20% gain in inference throughput, with model parameters and FLOPs held constant relative to ConvNeXt-T (Yu et al., 2023).

4. Empirical Performance and Ablations

On the ImageNet-1K benchmark:

Top-1 Accuracy:
- ConvNeXt-T: 82.1%
- InceptionNeXt-T: 82.3% (+0.2 pt)

Ablation studies reveal the importance of branch composition:

Removing the $1\times11$ or $11\times1$ branch leads to accuracy drop to 81.9%.
Removing the $3\times3$ branch yields 82.0% accuracy, trading a marginal loss for modest speedup.
$k_b=11$ demonstrates optimality (band-kernel size); increasing or decreasing this causes degradation.
Branch ratio $r_g=1/8$ is optimal; smaller ratios (e.g., $1/16$) result in losses >2 pts, while larger offer no gain.

Further, in semantic segmentation tasks (ADE20K with UperNet or FPN), InceptionNeXt-T outperforms ConvNeXt-T by 1–2 mIoU points at slightly lower FLOPs.

5. Training Protocols and Implementation

Training on ImageNet-1K aligns with DeiT methodology but omits distillation. Crucial hyperparameters:

Optimizer: AdamW ( $\epsilon=1\times 10^{-8}$ , $\beta_1=0.9$ , $\beta_2=0.999$ , weight decay 0.05)
Learning Rate: $0.001 \times (\text{batch}/1024)$ ; e.g., batch=4096 $\implies$ LR=0.004
LR Schedule: Cosine decay, 20 warmup epochs
Epochs: 300, stochastic depth peak rate 0.1 for Tiny
Data Augmentation: Random-resized crop, horizontal flip, RandAugment(9, 0.5), Mixup ( $\alpha=0.8$ ), CutMix (1.0), Random Erasing (p=0.25), color jitter
Regularization: Label smoothing 0.1, LayerScale initialization $1\times 10^{-6}$ , head dropout 0.0

Fine-tuning at resolution $384 \times 384$ uses 30 epochs, batch 1024, LR= $5 \times 10^{-5}$ , head dropout 0.5, EMA=0.9999.

6. Practical Implications and Recommendations

Decomposing the $k\times k$ single-path depthwise convolution into four efficient branches achieves similar effective receptive field and accuracy to ConvNeXt while providing marked improvements in speed. Although inference speedups are less pronounced than training, potential for additional gains exists with further kernel fusion or custom CUDA kernels.

The authors of InceptionNeXt suggest using this design as an "economical baseline" for future architecture development, with an explicit recommendation to reduce GPU hours and thereby carbon footprint for large-scale model research. The approach generalizes to dense prediction tasks, with demonstrated improvements in segmentation.

In summary, InceptionNeXt-T retains the ConvNeXt four-stage backbone and MLP-block configuration, substituting the $7\times7$ depthwise convolution with an Inception-style multi-branch token mixer. This yields a statistically significant accuracy improvement, a substantial increase in training speed, and enhanced inference throughput without an increase in model footprint (Yu et al., 2023).

PDF Markdown Chat (Pro)

References (1)

InceptionNeXt: When Inception Meets ConvNeXt (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to InceptionNeXt-Tiny.