DecomposeMe: Efficient Separable ConvNet Design
- DecomposeMe is a CNN architecture that factors 2D convolutions into sequential 1D filters with an intervening ReLU, significantly reducing parameter counts.
- The approach employs filter sharing across spatial positions, leading to reduced redundancy and lower computational overhead while maintaining model expressivity.
- Empirical results on benchmarks like ImageNet and Places2 confirm that DecomposeMe enhances generalization and efficiency in diverse network configurations.
DecomposeMe is a convolutional neural network (ConvNet) architecture modification that imposes a hard separability constraint at the level of convolutional filters, directly learning representations as compositions of 1D convolutions. This method offers substantial reductions in parameter count while maintaining or improving classification accuracy. DecomposeMe employs filter sharing across spatial positions and introduces nonlinearity (ReLU) between sequential 1D convolutions, increasing network depth and expressivity with minimal computational overhead. Comprehensive experiments on large-scale recognition benchmarks such as ImageNet and Places2 demonstrate the method’s capacity for high efficiency and strong generalization, all without post-training fine-tuning or approximations (Alvarez et al., 2016).
1. Foundational Concepts and Core Methodology
DecomposeMe enforces a separable-filter hard-constraint, parametrizing every 2D convolutional kernel as the composition of two 1D filters: one vertical () and one horizontal (). In contrast to low-rank approximation approaches that train full 2D filters and subsequently decompose, DecomposeMe directly trains decomposed 1D filters end-to-end.
Filter sharing is instituted within each layer by reusing the same bank of 1D filters across all spatial positions, removing redundant parameters and analogously reducing model complexity. An interposed nonlinearity—specifically, a ReLU activation—between the vertical and horizontal convolutions increases the effective nonlinear depth of the model, offering additional expressivity without enlarging parameter budgets.
2. Mathematical Formulation
A standard ConvNet layer with weights (where and are channel dimensions and is the spatial kernel size) learns each 2D kernel . Low-rank approximations write , but this is post hoc and only approximate.
Instead, DecomposeMe constrains every 2D filter to be a composition of two 1D filter banks (vertical) and (horizontal), learned end-to-end. For input feature maps , the output is given by
where denotes 1D convolution, is ReLU, and is the number of intermediate 1D filters.
3. Network Architecture Modifications
DecomposeMe conversion of any , conv layer consists of:
- A vertical 1D convolution (, ).
- An intervening ReLU nonlinearity.
- A horizontal 1D convolution (, ).
The number of output channels remains unchanged. Filter sharing is mandatory: the same 1D filter bank is used at every spatial location in the layer. Architectural features such as pooling, batch normalization, and dropout are retained as in the source network. For compact variants, the two large fully connected layers are removed, with the last convolution output flattened directly for final classification.
4. Parameter Efficiency and Expressivity
The parameter count for a standard 2D convolutional layer is . For a DecomposeMe layer:
The reduction in parameters is substantial when . For example, for a VGG-style configuration (, , ), DecomposeMe reduces parameters by approximately 33% compared to the original layer. The explicit percentage reduction is:
In typical settings, one selects or to balance expressivity with model compression.
5. Training Regimen and Hyperparameter Configuration
DecomposeMe networks are trained with Torch-7 from scratch (no pretraining). Stochastic gradient descent with momentum 0.9 and weight decay of is used, with an initial learning rate of 0.01, decreased on plateau. Data augmentation consists of random cropping and horizontal flip at probability 0.5. Batch sizes vary by architecture: AlexNet-style models use 96 per GPU, VGG-B variants use 24 per GPU, and compact variants such as DecomposeMe use batch sizes up to 256, leveraging the reduced memory footprint. Dropout is omitted in compact variants’ final classifier due to already low parameter counts.
6. Empirical Performance on Benchmarks
DecomposeMe achieves performance competitive with, or superior to, standard architectures while dramatically reducing parameter counts. The following table summarizes selected empirical results:
| Architecture | Top-1 Accuracy | Conv+FC Params (M) | Relative Reduction |
|---|---|---|---|
| VGG-B (ImageNet full) | 62.5% | 9.4 + 123.5 | Baseline |
| DecomposeMe (full) | 57.8% | 2.4 + 123.5 | –75% conv |
| VGG-B (compact) | 61.1% | 9.4 + 25.0 | |
| DecomposeMe | 65.4% | 7.0 + 8.2 | –26% conv, –67% FC |
| DecomposeMe | 66.2% | 7.0 + 0.5 | |
| VGG-B (Places2 full) | 44.0% | 9.4 + 121 | Baseline |
| DecomposeMe | 47.4% | 7.0 + 3.2 | –92% total |
On ImageNet 2012, DecomposeMe (best: 61.8% Top-1, –15% conv params) and DecomposeMe (best: 66.2% Top-1) outperformed or matched baselines. On Places2, DecomposeMe yielded a relative Top-1 accuracy increase of approximately +7.7% with 92% fewer parameters than VGG-B. In stereo matching for the KITTI 2012 benchmark, a DecomposeMe MC-CNN variant achieved comparable matching error rates with up to 90% parameter reduction.
In all settings experimentally explored, DecomposeMe variants met or exceeded baseline accuracy, significantly reduced model size, and frequently exhibited smaller train-validation performance gaps.
7. Application to Diverse Networks and Tasks
DecomposeMe’s procedure is broadly applicable:
- Full conversion of VGG-B (all conv layers replaced with DecomposeMe modules) allowed larger batch sizes during training and, in compact form, outperformed the original in classification accuracy.
- When applied to MC-CNN feature extractors for stereo matching, parameter count was reduced by an order of magnitude with only negligible increases in error rate.
- The architecture promotes rapid experimentation and efficient deployment, especially for memory- or computation-constrained applications.
8. Limitations and Prospects for Further Development
Principal limitations include:
- The method yields only modest speedup in the first conv layer when the number of input channels is small (e.g., RGB).
- The choice of , the intermediate filter count, is a crucial but currently manual hyperparameter, trading off expressivity and compression. Automated or adaptive selection of per layer is an open challenge.
- Omitting the intermediate ReLU drastically degrades performance, confirming that increased nonlinear depth is essential.
- Application beyond classification and stereo tasks (e.g., detection, segmentation, generative models) remains unexplored and constitutes a direction for future research.
DecomposeMe establishes a paradigm for hard-separable, nonlinear convolutional architectures that balance compactness with accuracy, eliminating the need for post hoc low-rank approximations and providing a foundation for efficient ConvNet design (Alvarez et al., 2016).