MBInception: Efficient Multi-Block Inception CNN
- MBInception is a CNN architecture that employs stacked inception modules to efficiently extract multi-scale features, demonstrated on datasets like CIFAR-10 and MNIST.
- The design integrates modular inception blocks with parallel convolution branches, batch normalization, and dropout to maintain parameter efficiency while achieving competitive accuracy.
- Empirical evaluations show MBInception achieves comparable or superior accuracy and F1 scores to models like VGG16 and ResNet50, despite using fewer parameters.
MBInception is a convolutional neural network (CNN) architecture designed for efficient image classification, introduced as a multi-block inception model that stacks inception-style modules to extract multi-scale features. It is specifically constructed to enhance parameter efficiency and image processing performance, providing a systematic comparative advantage over widely used architectures such as Visual Geometry Group (VGG), Residual Network (ResNet), and MobileNet. Evaluated on canonical datasets—CIFAR-10, CIFAR-100, MNIST, and Fashion-MNIST—MBInception demonstrates superior or competitive accuracy and F1 scores while employing fewer parameters than deeper models such as ResNet50 (Froughirad et al., 2024).
1. Architectural Design and Block Structure
MBInception's architecture is grounded in the Inception family of networks but introduces a methodical stacking of four main blocks, each built from two consecutive "Inception Modules" followed by a 3×3 convolution, with incrementally increased channel widths. The architectural flow is as follows:
- Input: 32×32×3 image tensor.
- Stem Layer:
- 7×7 2D convolution (n filters, where n is a design hyperparameter),
- Batch normalization,
- ReLU activation,
- 3×3 max pooling (stride 2, padding 1).
- Four Main Blocks: Each block ( filters per convolution, ) includes:
- Concatenate Module B output with original block input along the channel axis 1 ReLU,
- 4. 3×3 convolution (2 filters) 3 BatchNorm 4 ReLU.
- Classifier Head:
- Flatten,
- Dropout,
- Dense layer with one unit per class,
- Softmax.
The inception modules themselves are not fully detailed in terms of branch configuration in the primary reference; however, they are described as "m-filter" modules, analogously implying parallel 1×1, 3×3, and possibly 5×5 convolutions, as in classical GoogLeNet, possibly augmented with max-pooling branches—all producing 5 output channels.
2. Mathematical Formulation
The network’s computation primarily utilizes multi-branch convolutional operations and channel-wise concatenation. For an input tensor 6, a 7 convolution with 8 output channels computes the 9th output channel as:
0
for 1, where 2 denotes 2D spatial convolution.
Within an Inception module featuring 3 parallel branches 4, each yielding 5 channels,
6
If 7 for each branch and four branches are used, the output channel count increases by 8 (before any optional 1×1 projections).
Parameter counts per layer:
- 1×1 convolution: 9,
- 3×3 convolution: 0,
- 5×5 convolution: 1.
For a main block comprising two "m-filter" Inception modules plus one 1×1 and one 3×3 convolution, the parameter budget is:
2
summed over 3 and augmented by the stem convolution.
3. Training Protocols and Dataset Handling
MBInception has been comprehensively benchmarked on:
- CIFAR-10: 4, 60,000 images, 10 classes.
- CIFAR-100: 5, 60,000 images, 100 classes.
- MNIST: 28×28 grayscale, resized to 32×32×3 through channel stack.
- Fashion-MNIST: Same preprocessing as MNIST.
Preprocessing steps for all datasets include resizing to 32×32, grayscale-to-RGB conversion by channel duplication, and pixel normalization to the 6 interval.
Optimization utilizes the NADAM (Nesterov-accelerated Adam) optimizer as formulated in equations (1)–(5) of the source, but the paper does not report specific learning rates, batch sizes, or epochs. Dropout is applied within each Inception module, and batch normalization follows every convolutional layer, though drop-rate specifics are not enumerated.
4. Empirical Performance and Comparative Analysis
Empirical evaluation of MBInception is conducted against VGG16, ResNet50, and MobileNet, with parameter counts detailed as follows:
| Model | Parameters (Approx.) |
|---|---|
| MobileNet | 4M |
| VGG16 | 14M |
| MBInception | 16M |
| ResNet50 | 24M |
Performance across benchmarks:
- CIFAR-10: VGG16 attains the highest accuracy (766.9%), MBInception is nearly equivalent (866.7%), with MBInception displaying competitive F1 (65.1%).
- CIFAR-100: MBInception demonstrates superior Precision (90.4206) and F1 (00.0567), outperforming the other models.
- MNIST: MBInception leads all metrics (Accuracy 199.22%, F1 294.98%), with VGG16 and ResNet50 in the 96–99% accuracy range.
- Fashion-MNIST: MBInception achieves the highest results again (Accuracy 391.12%, F1 446.08%).
Inference speed statistics are not reported. Across all tasks, MBInception consistently matches or exceeds the performance of larger models (ResNet50), and markedly outperforms MobileNet, while using substantially fewer parameters than ResNet50 (Froughirad et al., 2024).
5. Analytical Insights and Architectural Trade-offs
MBInception's multi-block stacking of light-weight Inception modules is effective for extracting features at multiple scales, enabling the model to balance parameter count and accuracy. The use of batch normalization and dropout after every module provides robust regularization, helping mitigate overfitting. MBInception's parameter efficiency is notable: it attains or surpasses the accuracy of larger networks (such as ResNet50) on more complex datasets while employing fewer parameters.
Potential limitations or ambiguities include the lack of reported hyperparameter values (such as learning rate, batch size, and number of epochs), impeding reproducibility. Additionally, the precise internal architecture of each custom Inception Module (branch and filter configurations) is not detailed.
A plausible implication is that further ablation studies—specifically on the internal design of Inception modules and dropout rates—could yield additional improvements or more precise parameter-accuracy trade-offs.
6. Future Directions and Applications
Areas for future work noted include:
- Detailed ablation studies of per-branch filter sizes and dropout rates within each Inception module.
- Extension and adaptation of MBInception to higher-resolution datasets (e.g., ImageNet), as well as expansion towards tasks beyond classification, such as semantic segmentation or detection.
- Automated hyperparameter search for optimal base filter count (5) and stack depths, enhancing portability across domains and dataset sizes.
MBInception's straightforward, scalable design suggests adaptability to a broad range of image processing tasks, with the potential for further improvements via architectural and optimization refinements.
7. Significance and Positioning in Deep Learning
MBInception represents a practical evolution in the design of parameter-efficient deep learning architectures, reinforcing the utility of inception-style modules for multi-scale feature extraction. Among modern CNN architectures targeting compactness and accuracy, MBInception strikes a balance by systematically increasing capacity through block-stacking, delivering strong empirical performance across standard benchmarks. Its design reflects a trend toward modular, configurable neural networks that facilitate both efficient deployment and competitive accuracy in structured vision tasks (Froughirad et al., 2024).