LeanConvNets: Efficient CNN Architectures

Updated 4 January 2026

LeanConvNets are efficient CNN architectures that reduce computational costs by replacing full convolutions with sparsified spatial operators and 1x1 channel mixing.
They enable tunable trade-offs between accuracy and efficiency, achieving 3–10× reductions in weights and FLOPs compared to standard designs.
Empirical evaluations show LeanConvNets perform competitively on datasets such as ImageNet, often outperforming models like MobileNetV2 and ShuffleNetV2.

LeanConvNets are a family of convolutional neural architectures designed to achieve significant reductions in computational cost and parameter count while attaining accuracy competitive with state-of-the-art dense Convolutional Neural Networks (CNNs). The fundamental idea is to replace standard fully-coupled spatial convolutions with sparsified convolution operators that sum low-cost grouped spatial convolutions with full point-wise channel mixing. This framework introduces tunable architectural parameters that allow users to balance efficiency and accuracy. LeanConvNets are readily integrable into popular backbones such as ResNet, yielding models that require 3–10× fewer weights and FLOPs, and often outperform architectures like MobileNetV2 and ShuffleNetV2 under similar budget constraints (Ephrath et al., 2019, Ephrath et al., 2019).

1. Lean Convolutional Operators: Definition and Mechanics

Traditional convolutional layers in CNNs employ fully-coupled $k \times k$ spatial kernels $W_\text{full} \in \mathbb{R}^{k \times k \times C_\text{in} \times C_\text{out}}$ that jointly mix all $C_\text{in}$ input and $C_\text{out}$ output channels at each spatial location. The total parameter and FLOP count per spatial position are $k^2 C_\text{in} C_\text{out}$ .

LeanConvNets introduce "lean convolution" operators, which decompose the convolution into two additive parts:

A grouped or depth-wise spatial convolution with a sparsified stencil (e.g., five-point or three-point separable), applied independently within groups of channels or per-channel ("depth-wise").
A full $1 \times 1$ pointwise convolution that couples all channels at every spatial position.

Formally, for each output feature $Y$ ,

$Y_{h, w, o} = \sum_{i \in \text{group}(o)} (W_{spatial}^{(o,i)} \star X_{:, :, i})_{h, w} + \alpha_{o, i} X_{h, w, i},$

where $W_{spatial}^{(o,i)}$ is the grouped spatial kernel and $\alpha_{o, i}$ are the pointwise weights (Ephrath et al., 2019). For depth-wise plus pointwise with five-point stencil (for channel $i$ ),

$D_{p, q, i, i} = c_{i, j} \text{ as } (p, q) \text{ runs over } \{(-1,0), (0,-1), (0,0), (0,1), (1,0)\},$

and pointwise parameters $\alpha_{i,o}$ encode $1 \times 1$ channel mixing (Ephrath et al., 2019).

This reduction splits the modeling capacity into spatial locality (grouped/structured) and global channel fusion (pointwise), offering both interpretability and efficiency.

2. Computational Complexity and Efficiency Gains

The principal efficiency of LeanConvNets derives from two orthogonal savings:

Sparsified spatial coupling: Under grouping (with $g$ groups), the number of spatial kernel parameters and associated multiplications is reduced by a factor of $g$ .
Pointwise efficiency: The $1 \times 1$ convolution retains all channel-to-channel expressivity with minimal spatial cost.

Compared to the full $k \times k$ convolution with $C^2$ channels,

$\text{Parameters}_{\text{lean}} = C^2 \left(1 + \frac{r}{g}\right),$

where $r = k^2 - 1$ excludes the spatial center, moved to the pointwise term. E.g., for $k=3, r=8$ , five-point uses $r=4$ . FLOPs per spatial position scale accordingly.

The relative parameter and FLOP reduction compared to standard convolution is

$\frac{1 + r/g}{k^2}.$

For sufficiently large $g$ , this yields $\sim 6-9\times$ reduction for typical $k,g$ values (Ephrath et al., 2019, Ephrath et al., 2019). Empirical latency measurements confirm substantial wall-clock reductions, especially when using fused CUDA implementations.

3. Integration into Canonical Architectures

The modularity of the lean convolution operator allows straightforward integration into standard CNN backbones:

Residual Networks (ResNet): Each $3 \times 3$ convolution in the classic pre-activation block is replaced with a lean convolution, without modifying the placement of batch normalization or nonlinearity. The three-layer bottleneck variant is handled analogously: only the mid-layer $3 \times 3$ is swapped, with the $1 \times 1$ projections retained in full (Ephrath et al., 2019).
Semantic Segmentation Backbones: Lean convolutions can be inserted into encoder–decoder structures (e.g., U-Net, DeepLabV3) with minimal loss in mIoU (Ephrath et al., 2019).

If the architecture requires a channel or stride change, standard $1 \times 1$ projections remain unaffected.

4. Benchmark Results and Empirical Performance

LeanConvNets achieve competitive or superior accuracy to compact CNN variants under matched FLOP and parameter budgets. On CIFAR-10, CIFAR-100, STL-10, and ImageNet, LeanConvNet variants match or slightly outperform MobileNetV2 and ShuffleNetV2.

Selected benchmark results are summarized below (Ephrath et al., 2019, Ephrath et al., 2019):

Architecture	Params (M)	FLOPs (M)	CIFAR-10 (%)	CIFAR-100 (%)	ImageNet Top-1 (%)
ResNet-34 (full)	21.8	3600	—	—	74.0
LeanResNet-34 (5-pt, g=16)	4.1	36.0	—	—	60.2 (Cityscapes mIoU)
LeanResNeXt-34 (5-pt, g=16)	3.9	630	—	—	72.1
MobileNetV2 1.0×	3.47	301	—	—	71.9
LeanRes24 5-pt (DW)	0.53	26	92.8	74.3	—

Key findings:

LeanResNet-34 reduced parameter count by $\sim$ 5× yet maintained $\sim$ 95% of semantic segmentation mIoU on Cityscapes.
On ImageNet, LeanResNeXt-34 (5-pt, grouped) with $3.9$M parameters matched the top-1 accuracy of MobileNetV2 1.0×.
Across datasets, LeanConvNet accuracy is robust to reductions in spatial kernel richness, provided the $1 \times 1$ channel mixing is maintained.

5. Operator Variants and Tuning

LeanConvNet efficiency and accuracy are tunable via:

Group count ( $g$ ): Controls the trade-off between spatial coupling and overall leanness. Larger $g$ yields higher savings but may degrade accuracy if spatial expressivity is insufficient.
Stencil width: Five-point vs. three-point separable kernels offer varying degrees of memory and compute efficiency. Five-point maintains greater spatial capacity with $\approx 4/9$ the weight of a $3 \times 3$ kernel, while three-point enables maximum alignment on GPU/CPU at slight accuracy cost.
Layerwise customization: Early layers, where $C$ is small, can retain full $3 \times 3$ convolutions; later layers, with large $C$ , benefit maximally from lean variants.

Recommended practice: set $r/g \approx 1/8 - 1/4$ for most vision backbones, adjust per block as needed, and tune learning schedules to compensate for changed parameterization (Ephrath et al., 2019).

6. Implementation and Deployment Considerations

LeanConvNet modules are readily instantiated in major deep learning frameworks:

PyTorch: Compose a $1 \times 1$ convolution and a depth-wise $3 \times 3$ with masked weights for the spatial stencil, fusing their outputs (Ephrath et al., 2019).
TensorFlow/Keras: Implement a Conv2D(1,1) alongside a masked DepthwiseConv2D(kernel_size=3).

A practical example for PyTorch:

import torch
import torch.nn as nn

class LeanConv2d(nn.Module):
    def __init__(self, c_in, c_out):
        super().__init__()
        self.pw = nn.Conv2d(c_in, c_out, kernel_size=1, bias=False)
        self.dw_weights = nn.Parameter(torch.zeros(c_in, 4))
        self.offsets = [(-1,0),(0,-1),(0,1),(1,0)]

    def forward(self, x):
        y_pw = self.pw(x)
        N,C,H,W = x.shape
        y_dw = x.new_zeros(N, C, H, W)
        for idx,(p,q) in enumerate(self.offsets):
            y_dw += self.dw_weights[:, idx].view(1,-1,1,1) * torch.roll(x, shifts=(p,q), dims=(2,3))
        return y_pw + y_dw

For tasks requiring rich spatial-channel mixing (e.g., fine-grained segmentation), retaining some full convolutions in the early or critical blocks may be beneficial (Ephrath et al., 2019, Ephrath et al., 2019).

A major consideration is that actual hardware speedup may lag FLOP count reductions unless efficient fused implementations are available. On typical accelerators, fused LeanConv is observed to be 2–6× faster per layer than a "1×1 then depth-wise" stack at moderate $C$ (Ephrath et al., 2019).

7. Comparative Perspective and Practical Relevance

LeanConvNets provide a systematic pathway to sparsify standard CNN operators, outperforming baseline strategies such as grouped convolutions or depth-wise separable convolutions in several empirical settings. Unlike MobileNetV2 or ShuffleNetV2, which use serial depth-wise and point-wise layers, LeanConvNet fuses spatial and channel mixing additively; this increases expressive power per parameter at comparable computational cost (Ephrath et al., 2019).

Main advantages:

Routine substitution for existing dense CNNs, without layerwise architectural redesign.
Consistent accuracy under parameter and FLOP constraints, often exceeding other lightweight designs.
Adaptive trade-off between spatial depth and channel coupling enabled by $g$ and stencil choice.

Limitations include modest losses in accuracy for tasks requiring fine spatial–channel mixing and dependency on low-level implementation efficiency for realizing theoretical speedups in latency (Ephrath et al., 2019, Ephrath et al., 2019).

In summary, LeanConvNets constitute a unifying and flexible framework for constructing low-cost CNNs by combining stenciled spatial filtering with pointwise channel fusion, backed by extensive empirical validation and practical implementations (Ephrath et al., 2019, Ephrath et al., 2019).