Stripe-Wise Pruning (SWP)
- Stripe-Wise Pruning (SWP) is a neural network compression method that decomposes convolutional filters into spatial stripes and prunes them independently for enhanced granularity.
- It employs a learnable Filter Skeleton with ℓ1 regularization to systematically identify and remove insignificant stripes while preserving structured dense computation.
- Experimental results on CIFAR-10 and ImageNet demonstrate that SWP achieves substantial parameter and FLOPs reductions with minimal accuracy drop compared to traditional pruning techniques.
Stripe-Wise Pruning (SWP) is a neural network compression technique that achieves a fine-grained, hardware-friendly reduction in convolutional model size by pruning individual spatial stripes within filters, as opposed to removing entire filters or unstructured individual weights. SWP introduces a learnable "Filter Skeleton" for each filter's spatial grid, enabling systematic identification and removal of stripes with minimal impact on accuracy. This technique offers a compression-accuracy trade-off superior to traditional filter pruning and preserves structured computation patterns compatible with standard hardware (Meng et al., 2020).
1. Motivation: Pruning Granularity and Hardware Compatibility
Traditional neural network pruning techniques fall into two principal categories: Weight Pruning (WP) and Filter Pruning (FP). WP removes individual weight elements based on importance criteria (e.g., magnitude), resulting in high parameter sparsity. However, the induced irregular patterns require specialized sparse libraries or hardware for efficient inference. In contrast, FP removes whole filters or input channels from a layer, thus maintaining dense kernel structures that general-purpose hardware exploits but at the cost of limited pruning granularity. Coarse units—entire filters—restrict maximum achievable compression before accuracy decay.
Stripe-Wise Pruning (SWP) bridges these paradigms by decomposing each convolutional filter into spatial stripes (rank-1 sub-filters) and pruning these stripes independently. This yields times more candidate prune units than FP, enabling finer granularity while retaining the structured decomposition necessary for efficient dense inference.
2. Mathematical Formulation: Filter Skeleton and Optimization
For the -th convolutional layer, with a weight tensor , SWP introduces a Filter Skeleton parametrizing a learnable scaling factor for each stripe of each filter. The masked weights are given by
where indexes filters, input channels, and spatial positions.
The training objective combines standard supervised loss with an penalty on the Stripe Skeletons: where tunes the trade-off between model accuracy and stripe sparsity. The penalty term explicitly encourages minimal stripe usage by promoting many Skeleton values to become small or zero. Both model weights and Skeletons are trained jointly via standard backpropagation.
3. SWP Training and Pruning Workflow
SWP proceeds in two main phases:
Phase A: Joint Training with Skeleton
- Initialize model weights (e.g., Gaussian).
- Initialize Filter Skeletons to unity.
- Train using the combined loss above; are used in all convolutions.
- At convergence, quantifies the relative importance of each stripe.
Phase B: Stripe Thresholding and Physical Pruning
- For a given threshold , prune any stripe where .
- The convolution layer is re-assembled to sum only over surviving stripes:
where is the set of retained stripes for filter .
- Optionally, fine-tune the pruned network.
This pipeline is formalized in the following pseudocode:
1 2 3 4 5 6 7 8 9 10 |
W, I = init_weights(), torch.ones_like(stripe_shape) for epoch in range(T): # Forward with W⊙I, compute L = CE + λ⋅sum(|I|) # Standard backprop for W and I ... for l in layers: for n, i, j in stripes(l): if I[l, n, i, j] < δ: remove_stripe(l, n, i, j) rebuild_model_without_dead_stripes() |
4. Structured Inference and Compression Metrics
Unlike WP, which necessitates custom computation kernels for irregularly sparse weight matrices, SWP retains a structured convolutional computation: partial sums over each stripe are computed via dense kernels and then aggregated. Only the range over which the spatial part of the convolution operates is changed—no sparse matrix libraries are required. The index overhead (which stripes survive in each layer) remains negligible: binary flags per layer, versus for WP. For typical , this reduces metadata storage dramatically (<1% of model size).
Empirical results on CIFAR-10 and ImageNet show substantial parameter and FLOPs reductions:
| Model / Dataset | Params↓ | FLOPs↓ | Accuracy Drop |
|---|---|---|---|
| VGG-16 / CIFAR-10 | -92.66% | -71.16% | -0.40% top-1 |
| ResNet-56 / CIFAR-10 | -77.7% | -75.6% | -0.12% top-1 |
| ResNet-18 / ImageNet | -54.6% FLOPs | - | -0.17% top-1, -0.04% top-5 |
SWP thus enables compression ratios close to WP with hardware efficiency comparable to FP.
5. Experimental Evaluation and Comparative Analysis
Across multiple benchmarks, SWP achieves state-of-the-art compression at minimal accuracy cost. On CIFAR-10, pruning VGG-16 from 93.25% baseline to 92.85% with 92.66% parameter and 71.16% FLOPs reduction; for ResNet-56, baseline 93.1% drops to 92.98% while yielding 77.7% parameter and 75.6% FLOPs savings. On ImageNet (ResNet-18), SWP with and 54.58% FLOPs reduction results in top-1 accuracy drop of only 0.17%.
Comparisons with prior FP and group-wise pruning approaches (L1, ThiNet, SFP, GAL, HRank, GBN) demonstrate that, for equivalent model size, SWP either retains higher accuracy or achieves greater compression.
6. Ablation Studies and Insights
Experiments isolating the impact of the Skeleton (optimizing only , keeping at random) show that filter architecture alone promotes significant inductive bias (e.g., 79.83% for VGG-16, 83.82% for ResNet-56 on CIFAR-10). Sensitivity studies varying (regularization) and (pruning threshold) demonstrate stable network performance across a wide range; see table below for ResNet-56/CIFAR-10 with :
| δ | 0.01 | 0.03 | 0.05 | 0.07 | 0.09 |
|---|---|---|---|---|---|
| Params(M) | 0.45 | 0.34 | 0.21 | 0.16 | 0.12 |
| FLOPs(M) | 111.68 | 74.83 | 56.10 | 41.59 | 29.72 |
| Acc(%) | 93.25 | 92.82 | 92.98 | 92.43 | 91.83 |
SWP consistently outperforms group-wise (e.g., lasso-grouped) pruning, maintaining higher accuracy at high sparsity.
Visualization of filter shapes post-pruning reveals a trend: middle layers often reduce to sparse skeletons with few active stripes, while shallower layers retain more stripes—indicative of greater feature diversity.
7. Implementation, Deployment, and Open Problems
PyTorch implementations of SWP insert the Skeleton as a multiplicative mask on each filter's spatial grid. Following thresholding, custom “StripeConv” layers accumulate surviving stripes via grouped convolution, and the final pruned model is exportable as standard dense convolutions without recourse to specialized kernels. Skeleton-related parameter overhead is negligible compared to total model size.
A fixed, global pruning threshold is currently used, but adaptive, per-layer thresholds may enhance utility. The static Skeleton only captures the fixed filter shape; dynamic, input-dependent Skeletons, or integration with quantization or low-rank decomposition, represent open research directions (Meng et al., 2020).