ConvNeXt-Tiny: Efficient CNN Architecture

Updated 30 November 2025

ConvNeXt-Tiny is a compact deep convolutional network that integrates transformer-inspired design with large-kernel depthwise convolutions and hierarchical multi-stage processing.
It employs advanced feature fusion through dual pooling and lightweight squeeze-and-excitation blocks to optimize image representation while minimizing computational cost.
Empirical evaluations show its effectiveness in domains like medical imaging and IoT, balancing high accuracy with reduced latency and resource demands.

ConvNeXt-Tiny is a compact deep convolutional neural network architecture belonging to the ConvNeXt family, designed to deliver high accuracy while maintaining computational and memory efficiency. Developed following the design philosophy of Swin Transformers and modern vision transformers, ConvNeXt-Tiny adapts successful architectural components from both convolutional networks and transformer models for image classification and related perception tasks. It features hierarchical multi-stage processing, large kernel depthwise convolutions, inverted bottlenecks, and normalization strategies that collectively allow for improved receptive field, stability, and generalization. The tiny variant (ConvNeXt-Tiny) is optimized for deployment in resource-constrained environments, including edge devices and scenarios requiring low inference latency or limited training hardware.

1. Architecture and Block Design

ConvNeXt-Tiny employs a canonical four-stage layout, mirroring designs such as ResNet and Swin Transformer, but updates the block composition for enhanced efficiency and representational power (Yu et al., 2023). Each stage contains multiple residual blocks whose structure is as follows (Xia et al., 15 Aug 2025, Xiao et al., 1 Sep 2025, Roshanzadeh et al., 7 Sep 2025):

Input: $X \in \mathbb{R}^{C \times H \times W}$ , where $C$ is the channel count.
Pre-normalization: LayerNorm applied over channels last.
Depthwise Convolution: $7 \times 7$ kernel (DW-Conv) with padding of 3, groups = $C$ , enabling large-receptive-field context at minimal computational overhead.
MLP-like Sub-block:
- First pointwise $(1 \times 1)$ convolution expands channels from $C$ to $4C$.
- GELU non-linearity.
- Second pointwise $(1 \times 1)$ convolution projects back from $4C$ to $C$ .
Residual Connection: The output is added to the block input; post-activation is omitted due to pre-norm.
Block Depths: Standard configuration—3, 3, 9, 3 blocks in stages 1–4, with channel widths 96, 192, 384, 768, respectively.

The overall model contains $\sim$ 28M parameters and entails $\sim$ 4.5 GFLOPs for $224 \times 224$ images (Yu et al., 2023).

2. Computational Characteristics and Efficiency

ConvNeXt-Tiny’s use of large-kernel depthwise convolutions results in low FLOPs relative to conventional convolutions but incurs substantial memory access costs on modern hardware, particularly GPUs. For a layer with kernel size $k = 7$ , the FLOPs per depthwise conv are:

$\text{FLOPs}_{\text{DW}} = 2 \cdot k^2 \cdot C \cdot H \cdot W$

Despite similar FLOPs to ResNet-50 (4.1 GFLOPs), ConvNeXt-Tiny operates at only about $60\%$ of ResNet-50’s training throughput on A100 GPUs (575 vs 969 images/s) (Yu et al., 2023). This indicates throughput is bottlenecked by memory bandwidth rather than arithmetic operations, a critical consideration for model deployment and training efficiency.

Recent research proposes alternatives such as InceptionNeXt-Tiny, where the $(7 \times 7)$ depthwise convolution is decomposed into four branches— $(3\times3)$ , $(1\times11)$ , $(11\times1)$ convolutions, and an identity mapping—yielding a $1.6\times$ increase in training throughput and a $0.2\%$ increase in top-1 ImageNet accuracy (Yu et al., 2023).

Model	Params (M)	FLOPs (G)	Train img/s	Top-1 Acc (%)
ConvNeXt-Tiny	29	4.5	575	82.1
InceptionNeXt-T	28	4.2	901	82.3

3. Feature Fusion and Attention Mechanisms

Building on the standard ConvNeXt-Tiny backbone, improved variants integrate advanced feature fusion and channel attention strategies to boost expressiveness under hardware constraints (Xia et al., 15 Aug 2025):

Dual Global Pooling Feature Fusion: After the final ConvNeXt-Tiny backbone layer, both Global Average Pooling (GAP) and Global Max Pooling (GMP) operations extract complementary statistics:

$v_{\mathrm{avg}} = \mathrm{GAP}(F),\quad v_{\mathrm{max}} = \mathrm{GMP}(F)$

These are concatenated:

$v_{\mathrm{fused}} = [v_{\mathrm{avg}} ; v_{\mathrm{max}}] \in \mathbb{R}^{2C}$

This fusion preserves both global and salient channel-wise features.

Squeeze-and-Excitation Vector (SEVector):

SEVector adapts classical SE-blocks for lightness by squeezing $2C$ channels to a fixed dimensionality $D = \max(8, \lfloor 2C/16 \rfloor)$ , then exciting back to $2C$, markedly reducing parameter overhead compared to standard SE:

$\omega = \sigma ( W_2 \cdot \mathrm{ReLU}(W_1 v_{\mathrm{fused}}) )$

$v_{\mathrm{att}} = v_{\mathrm{fused}} \odot \omega$

For typical $C=96$ , both SEVector and standard SE require $\approx 4608$ parameters per block; however, SEVector restricts parameter scaling with $C$ , ensuring persistent efficiency in deeper models.

4. Loss Function Innovations and Optimization

In addition to cross-entropy, specialized regularization techniques improved intra-class compactness and controlled variance in feature representations (Xia et al., 15 Aug 2025):

Feature Smoothing Loss:

For each class $c$ in a mini-batch with $N_c$ samples, define the batch center:

$\bar{f}_c = \frac{1}{N_c} \sum_{i=1}^{N_c} f_{c, i}$

The feature smoothing loss penalizes deviation from the batch center:

$L_{\mathrm{fs}} = \frac{1}{C} \sum_{c=1}^{C} \frac{1}{N_c} \sum_{i=1}^{N_c} || f_{c, i} - \bar{f}_c ||^2$

Total training loss is:

$L = L_{\mathrm{CE}} + \lambda_{\mathrm{fs}} L_{\mathrm{fs}},\quad \lambda_{\mathrm{fs}} = 0.05$

This adaptation dynamically shrinks intra-class feature clouds, suppressing variance.

Other reported loss functions include hybrid combinations of focal loss and contrastive loss for tasks with class imbalance and hard negatives, specifically in mitosis detection pipelines (Xiao et al., 1 Sep 2025).

5. Application Domains and Empirical Performance

ConvNeXt-Tiny has demonstrated versatile applicability, notably in medical image analysis and network intrusion detection, often as part of hybrid or two-stage frameworks:

Medical Image Classification: An improved ConvNeXt-Tiny backbone augmented with dual pooling and SEVector achieved $89.09\%$ accuracy on an Alzheimer’s MRI 4-class test set, outperforming vanilla ConvNeXt-Tiny ( $82.42\%$ ) and baseline CNNs (Xia et al., 15 Aug 2025).
Two-Stage Mitosis Detection: Serving as a candidate filter after YOLO11x proposal generation, ConvNeXt-Tiny delivered an F1-score improvement from $0.847$ to $0.882$, with precision increasing from $0.762$ to $0.839$ (Xiao et al., 1 Sep 2025).
IoT Intrusion Detection: Integration into a hybrid 1D CNN pipeline resulted in $99.63\%$ accuracy and $0.0107$ error rate on eight-class network traffic datasets, achieving rapid inference ( $\sim 5.52$ ms per batch) suitable for resource-constrained IoT nodes (Roshanzadeh et al., 7 Sep 2025).

6. Deployment Characteristics and Suitability

ConvNeXt-Tiny's micro-architectural design, parameter economy, and efficient loss formulation make it ideal for edge deployment and real-time systems:

Low Memory and Latency: Typical inference latencies are $\sim 0.08$ s/image on CPUs (medical imaging) (Xia et al., 15 Aug 2025), and $< 6$ ms per batch (IoT) (Roshanzadeh et al., 7 Sep 2025), supporting real-time or near-real-time operation.
Small Model Footprint: At $\sim 28$ M parameters and $<5$ GFLOPs, ConvNeXt-Tiny is orders of magnitude smaller than transformer-based models with similar accuracy.
Adaptability: The architecture can be adapted to 1D sequential inputs (network traffic), 2D medical images, or downstream classification tasks with varied data augmentation and regularization schemes.

7. Comparative Analysis and Limitations

While ConvNeXt-Tiny achieves high accuracy and efficiency, its design tradeoffs include memory-access bottlenecks for large-kernel depthwise convolutions during GPU training (Yu et al., 2023). Attempts to mitigate this, such as Inception-style decomposition, yield substantial throughput gains and marginal accuracy improvements. This suggests that future architecture variants may exploit kernel decomposition and parallel branches to optimize for speed and energy efficiency, particularly as deployment targets shift towards green AI and low-carbon-footprint model training.

A plausible implication is that ConvNeXt-Tiny and its descendants represent an evolving subfamily of neural architectures balancing transformer-like design rules, convolutional efficiency, and practical deployment constraints. Researchers continue to augment the core blocks with attention, pooling, and loss innovations to further reduce complexity and improve real-world performance.