LeanConvNets: Efficient CNN Architectures
- LeanConvNets are efficient CNN architectures that reduce computational costs by replacing full convolutions with sparsified spatial operators and 1x1 channel mixing.
- They enable tunable trade-offs between accuracy and efficiency, achieving 3–10× reductions in weights and FLOPs compared to standard designs.
- Empirical evaluations show LeanConvNets perform competitively on datasets such as ImageNet, often outperforming models like MobileNetV2 and ShuffleNetV2.
LeanConvNets are a family of convolutional neural architectures designed to achieve significant reductions in computational cost and parameter count while attaining accuracy competitive with state-of-the-art dense Convolutional Neural Networks (CNNs). The fundamental idea is to replace standard fully-coupled spatial convolutions with sparsified convolution operators that sum low-cost grouped spatial convolutions with full point-wise channel mixing. This framework introduces tunable architectural parameters that allow users to balance efficiency and accuracy. LeanConvNets are readily integrable into popular backbones such as ResNet, yielding models that require 3–10× fewer weights and FLOPs, and often outperform architectures like MobileNetV2 and ShuffleNetV2 under similar budget constraints (Ephrath et al., 2019, Ephrath et al., 2019).
1. Lean Convolutional Operators: Definition and Mechanics
Traditional convolutional layers in CNNs employ fully-coupled spatial kernels that jointly mix all input and output channels at each spatial location. The total parameter and FLOP count per spatial position are .
LeanConvNets introduce "lean convolution" operators, which decompose the convolution into two additive parts:
- A grouped or depth-wise spatial convolution with a sparsified stencil (e.g., five-point or three-point separable), applied independently within groups of channels or per-channel ("depth-wise").
- A full pointwise convolution that couples all channels at every spatial position.
Formally, for each output feature ,
where is the grouped spatial kernel and are the pointwise weights (Ephrath et al., 2019). For depth-wise plus pointwise with five-point stencil (for channel ),
and pointwise parameters encode channel mixing (Ephrath et al., 2019).
This reduction splits the modeling capacity into spatial locality (grouped/structured) and global channel fusion (pointwise), offering both interpretability and efficiency.
2. Computational Complexity and Efficiency Gains
The principal efficiency of LeanConvNets derives from two orthogonal savings:
- Sparsified spatial coupling: Under grouping (with groups), the number of spatial kernel parameters and associated multiplications is reduced by a factor of .
- Pointwise efficiency: The convolution retains all channel-to-channel expressivity with minimal spatial cost.
Compared to the full convolution with channels,
where excludes the spatial center, moved to the pointwise term. E.g., for , five-point uses . FLOPs per spatial position scale accordingly.
The relative parameter and FLOP reduction compared to standard convolution is
For sufficiently large , this yields reduction for typical values (Ephrath et al., 2019, Ephrath et al., 2019). Empirical latency measurements confirm substantial wall-clock reductions, especially when using fused CUDA implementations.
3. Integration into Canonical Architectures
The modularity of the lean convolution operator allows straightforward integration into standard CNN backbones:
- Residual Networks (ResNet): Each convolution in the classic pre-activation block is replaced with a lean convolution, without modifying the placement of batch normalization or nonlinearity. The three-layer bottleneck variant is handled analogously: only the mid-layer is swapped, with the projections retained in full (Ephrath et al., 2019).
- Semantic Segmentation Backbones: Lean convolutions can be inserted into encoder–decoder structures (e.g., U-Net, DeepLabV3) with minimal loss in mIoU (Ephrath et al., 2019).
If the architecture requires a channel or stride change, standard projections remain unaffected.
4. Benchmark Results and Empirical Performance
LeanConvNets achieve competitive or superior accuracy to compact CNN variants under matched FLOP and parameter budgets. On CIFAR-10, CIFAR-100, STL-10, and ImageNet, LeanConvNet variants match or slightly outperform MobileNetV2 and ShuffleNetV2.
Selected benchmark results are summarized below (Ephrath et al., 2019, Ephrath et al., 2019):
| Architecture | Params (M) | FLOPs (M) | CIFAR-10 (%) | CIFAR-100 (%) | ImageNet Top-1 (%) |
|---|---|---|---|---|---|
| ResNet-34 (full) | 21.8 | 3600 | — | — | 74.0 |
| LeanResNet-34 (5-pt, g=16) | 4.1 | 36.0 | — | — | 60.2 (Cityscapes mIoU) |
| LeanResNeXt-34 (5-pt, g=16) | 3.9 | 630 | — | — | 72.1 |
| MobileNetV2 1.0× | 3.47 | 301 | — | — | 71.9 |
| LeanRes24 5-pt (DW) | 0.53 | 26 | 92.8 | 74.3 | — |
Key findings:
- LeanResNet-34 reduced parameter count by 5× yet maintained 95% of semantic segmentation mIoU on Cityscapes.
- On ImageNet, LeanResNeXt-34 (5-pt, grouped) with $3.9$M parameters matched the top-1 accuracy of MobileNetV2 1.0×.
- Across datasets, LeanConvNet accuracy is robust to reductions in spatial kernel richness, provided the channel mixing is maintained.
5. Operator Variants and Tuning
LeanConvNet efficiency and accuracy are tunable via:
- Group count (): Controls the trade-off between spatial coupling and overall leanness. Larger yields higher savings but may degrade accuracy if spatial expressivity is insufficient.
- Stencil width: Five-point vs. three-point separable kernels offer varying degrees of memory and compute efficiency. Five-point maintains greater spatial capacity with the weight of a kernel, while three-point enables maximum alignment on GPU/CPU at slight accuracy cost.
- Layerwise customization: Early layers, where is small, can retain full convolutions; later layers, with large , benefit maximally from lean variants.
Recommended practice: set for most vision backbones, adjust per block as needed, and tune learning schedules to compensate for changed parameterization (Ephrath et al., 2019).
6. Implementation and Deployment Considerations
LeanConvNet modules are readily instantiated in major deep learning frameworks:
- PyTorch: Compose a convolution and a depth-wise with masked weights for the spatial stencil, fusing their outputs (Ephrath et al., 2019).
- TensorFlow/Keras: Implement a
Conv2D(1,1)alongside a maskedDepthwiseConv2D(kernel_size=3).
A practical example for PyTorch:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import torch import torch.nn as nn class LeanConv2d(nn.Module): def __init__(self, c_in, c_out): super().__init__() self.pw = nn.Conv2d(c_in, c_out, kernel_size=1, bias=False) self.dw_weights = nn.Parameter(torch.zeros(c_in, 4)) self.offsets = [(-1,0),(0,-1),(0,1),(1,0)] def forward(self, x): y_pw = self.pw(x) N,C,H,W = x.shape y_dw = x.new_zeros(N, C, H, W) for idx,(p,q) in enumerate(self.offsets): y_dw += self.dw_weights[:, idx].view(1,-1,1,1) * torch.roll(x, shifts=(p,q), dims=(2,3)) return y_pw + y_dw |
A major consideration is that actual hardware speedup may lag FLOP count reductions unless efficient fused implementations are available. On typical accelerators, fused LeanConv is observed to be 2–6× faster per layer than a "1×1 then depth-wise" stack at moderate (Ephrath et al., 2019).
7. Comparative Perspective and Practical Relevance
LeanConvNets provide a systematic pathway to sparsify standard CNN operators, outperforming baseline strategies such as grouped convolutions or depth-wise separable convolutions in several empirical settings. Unlike MobileNetV2 or ShuffleNetV2, which use serial depth-wise and point-wise layers, LeanConvNet fuses spatial and channel mixing additively; this increases expressive power per parameter at comparable computational cost (Ephrath et al., 2019).
Main advantages:
- Routine substitution for existing dense CNNs, without layerwise architectural redesign.
- Consistent accuracy under parameter and FLOP constraints, often exceeding other lightweight designs.
- Adaptive trade-off between spatial depth and channel coupling enabled by and stencil choice.
Limitations include modest losses in accuracy for tasks requiring fine spatial–channel mixing and dependency on low-level implementation efficiency for realizing theoretical speedups in latency (Ephrath et al., 2019, Ephrath et al., 2019).
In summary, LeanConvNets constitute a unifying and flexible framework for constructing low-cost CNNs by combining stenciled spatial filtering with pointwise channel fusion, backed by extensive empirical validation and practical implementations (Ephrath et al., 2019, Ephrath et al., 2019).