Papers
Topics
Authors
Recent
2000 character limit reached

InceptionNeXt-Tiny: Efficient CNN Design

Updated 5 December 2025
  • The paper introduces InceptionNeXt-Tiny, which decomposes large-kernel depthwise convolutions into four specialized branches, enhancing training (+57%) and inference (+20%) speeds.
  • InceptionNeXt-Tiny is a convolutional neural network with a four-stage design that employs Inception-style depthwise convolutions to optimize channel allocation and receptive field coverage.
  • Empirical evaluations on ImageNet-1K reveal a competitive top-1 accuracy gain (+0.2 pt) and set a new economical baseline for efficient model architecture in image classification and segmentation.

InceptionNeXt-Tiny (InceptionNeXt-T) is a convolutional neural network architecture designed for efficient image classification and dense prediction tasks, retaining the macro-structure of ConvNeXt while introducing an Inception-style depthwise convolution to achieve superior computational throughput with competitive accuracy. The central feature is the decomposition of expensive large-kernel depthwise convolutions into four specialized parallel branches with a precise channel allocation, optimizing both memory access patterns and receptive field coverage. The model sets a new baseline for economical architecture design, particularly when maximizing resource efficiency without sacrificing empirical performance (Yu et al., 2023).

1. Network Structure and Stagewise Design

InceptionNeXt-T adheres to a four-stage architecture analogous to ConvNeXt. The pipeline is as follows:

  • Stem: A 4×44\times4 convolution, stride 4, embeds the input RGB image into C1=96C_1 = 96 channels, outputting feature maps of shape H4×W4×96\frac{H}{4}\times\frac{W}{4}\times 96.
  • Stage 1: Input resolution H4×W4\frac{H}{4}\times\frac{W}{4}, C1=96C_1=96 channels, 3 blocks. Each block employs an Inception depthwise convolution (IDConv) and an MLP with an expansion ratio of 4, followed by normalization layers (LayerNorm or BatchNorm).
  • Stage 2: Downsampling via 2×22\times2 convolution (stride 2) to C2=192C_2=192 channels, resolution H8×W8\frac{H}{8}\times\frac{W}{8}, 3 blocks (IDConv, MLP ratio 4).
  • Stage 3: Downsample to C3=384C_3=384 channels, resolution H16×W16\frac{H}{16}\times\frac{W}{16}, 9 blocks (IDConv, MLP ratio 4).
  • Stage 4: Downsample to C4=768C_4=768 channels, resolution H32×W32\frac{H}{32}\times\frac{W}{32}, 3 blocks (IDConv, MLP ratio 3).
  • Head: Concludes with global average pooling and a linear classifier with 1000 output classes.

This sequence preserves high spatial resolution in early stages, increases abstraction and channel width stagewise, and terminates in standard classification layers.

2. Inception Depthwise Convolution (IDConv)

The principal innovation is the Inception depthwise convolution (IDConv), which replaces ConvNeXt's monolithic k×kk\times k depthwise convolutions. The process includes:

  • Branch Decomposition: Given input X∈RH×W×CX \in \mathbb{R}^{H \times W \times C} and branch ratio rg=1/8r_g=1/8, the split allocates g=rgCg=r_g C channels to each of three branches (C1,C2,C3C_1, C_2, C_3), with the remainder to the identity branch (C4=C−3gC_4 = C - 3g).
  • Branch Operations:
  1. Square kernel: 3×33\times 3 DWConv
  2. Horizontal band kernel: 1×111\times 11 DWConv
  3. Vertical band kernel: 11×111\times 1 DWConv
  4. Identity mapping: passes through the remaining channels unchanged
  • Concatenation: Outputs from all branches are concatenated, reconstructing the feature tensor Y∈RH×W×CY \in \mathbb{R}^{H \times W \times C}.

For InceptionNeXt-T: C=96C=96, g=12g=12, yielding branch allocations C1=C2=C3=12C_1=C_2=C_3=12, C4=60C_4=60; kernel sizes ks=3k_s=3, kb=11k_b=11.

This decomposition allows the network to approximate the large receptive field of 7×77\times 7 depthwise convolution through efficient aggregation of spatially diverse filters.

3. Computational Complexity and Throughput

The theoretical and empirical resource demands are outlined below:

Model Params (M) MACs (G) Train Throughput (img/s) Inference Throughput (img/s)
ConvNeXt-T (k=7) ~28 ~4.2 575 2413
InceptionNeXt-T ~28 ~4.2 901 (+57%) 2900 (+20%)

Training was measured on an A100 GPU with full precision and batch size 128. InceptionNeXt-T demonstrates a +57% increase in training throughput, and +20% gain in inference throughput, with model parameters and FLOPs held constant relative to ConvNeXt-T (Yu et al., 2023).

4. Empirical Performance and Ablations

On the ImageNet-1K benchmark:

  • Top-1 Accuracy:
    • ConvNeXt-T: 82.1%
    • InceptionNeXt-T: 82.3% (+0.2 pt)

Ablation studies reveal the importance of branch composition:

  • Removing the 1×111\times11 or 11×111\times1 branch leads to accuracy drop to 81.9%.
  • Removing the 3×33\times3 branch yields 82.0% accuracy, trading a marginal loss for modest speedup.
  • kb=11k_b=11 demonstrates optimality (band-kernel size); increasing or decreasing this causes degradation.
  • Branch ratio rg=1/8r_g=1/8 is optimal; smaller ratios (e.g., $1/16$) result in losses >2 pts, while larger offer no gain.

Further, in semantic segmentation tasks (ADE20K with UperNet or FPN), InceptionNeXt-T outperforms ConvNeXt-T by 1–2 mIoU points at slightly lower FLOPs.

5. Training Protocols and Implementation

Training on ImageNet-1K aligns with DeiT methodology but omits distillation. Crucial hyperparameters:

  • Optimizer: AdamW (ϵ=1×10−8\epsilon=1\times 10^{-8}, β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, weight decay 0.05)
  • Learning Rate: 0.001×(batch/1024)0.001 \times (\text{batch}/1024); e.g., batch=4096   ⟹  \implies LR=0.004
  • LR Schedule: Cosine decay, 20 warmup epochs
  • Epochs: 300, stochastic depth peak rate 0.1 for Tiny
  • Data Augmentation: Random-resized crop, horizontal flip, RandAugment(9, 0.5), Mixup (α=0.8\alpha=0.8), CutMix (1.0), Random Erasing (p=0.25), color jitter
  • Regularization: Label smoothing 0.1, LayerScale initialization 1×10−61\times 10^{-6}, head dropout 0.0

Fine-tuning at resolution 384×384384 \times 384 uses 30 epochs, batch 1024, LR=5×10−55 \times 10^{-5}, head dropout 0.5, EMA=0.9999.

6. Practical Implications and Recommendations

Decomposing the k×kk\times k single-path depthwise convolution into four efficient branches achieves similar effective receptive field and accuracy to ConvNeXt while providing marked improvements in speed. Although inference speedups are less pronounced than training, potential for additional gains exists with further kernel fusion or custom CUDA kernels.

The authors of InceptionNeXt suggest using this design as an "economical baseline" for future architecture development, with an explicit recommendation to reduce GPU hours and thereby carbon footprint for large-scale model research. The approach generalizes to dense prediction tasks, with demonstrated improvements in segmentation.

In summary, InceptionNeXt-T retains the ConvNeXt four-stage backbone and MLP-block configuration, substituting the 7×77\times7 depthwise convolution with an Inception-style multi-branch token mixer. This yields a statistically significant accuracy improvement, a substantial increase in training speed, and enhanced inference throughput without an increase in model footprint (Yu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to InceptionNeXt-Tiny.