N2N-SCIP: Sparse DNN Pruning with Skip Connections
- The paper introduces N2N-SCIP, a framework combining single-shot network pruning with learnable neuron-to-neuron skip connections to preserve gradient flow in extremely sparse models.
- It enforces a fixed global sparsity budget by partitioning nonzero parameters equally between sequential weights and skip connections, ensuring controlled compression.
- Empirical results on CIFAR and ImageNet benchmarks demonstrate improved connectivity and top-1 accuracy, significantly outperforming standard pruning methods.
N2N-SCIP denotes a pruning-and-skip connection framework for learning highly sparse deep neural networks by combining single-shot network pruning at initialization with the integration of sparse, learnable neuron-to-neuron skip (N2NSkip) connections, while strictly maintaining a fixed global sparsity budget. Developed in the context of enhancing the connectivity and performance of extremely sparse pruned models, N2N-SCIP offers a rigorous algorithmic scheme for sampling, training, and analyzing such networks, supported by graph-theoretic connectivity metrics and large-scale empirical evaluation on standard benchmarks (Subramaniam et al., 2022).
1. Foundational Formulation and Pruning Regime
N2N-SCIP begins from a standard -layer feedforward architecture parameterized by weight tensors for , where is the neuron (or channel) count at layer . A layer-wise density specifies the fraction of nonzero weights retained post-pruning. Pruning proceeds at initialization, directly imposing binary masks such that
with denoting the elementwise product. The aggregate number of active (sequential, i.e., backbone) weights is
Pruning criteria may be random or based on connection sensitivity (e.g., SNIP), but N2N-SCIP requires only an initial mask—no iterative prune-retrain cycles are needed (Subramaniam et al., 2022).
2. Neuron-to-Neuron Skip Connection Model
Beyond pruned sequential weights, N2N-SCIP introduces learnable skip weights , connecting any neuron in layer to any neuron in a deeper layer . These are collected into a sparse tensor
with sparsity enforced using binary skip masks : In forward propagation, the pre-activation at layer generalizes to
where for nonlinearity .
This structure augments gradient pathways, addressing limitations of extreme pruning on information and gradient flow, especially at high compression ratios (Subramaniam et al., 2022).
3. Sparsity Budgeting and Skip Sampling
A fixed total parameter budget is enforced. This is split between sequential and skip connections: Typically, . Both sequential and skip masks are sampled (randomly or by importance) up-front to respect , blocking all further growth of nonzeros after initialization.
Sampling ensures that skip connections do not inflate parameter count while enabling denser inter-layer connectivity, especially across distant layers, mitigating typical layerwise bottlenecks induced by pruning (Subramaniam et al., 2022).
4. Algorithmic Procedure and Training
N2N-SCIP operates in three phases:
- Phase I: Initialize, prune backbone weights using the selected criterion, compute .
- Phase II: Sample skip edges among all possible pairs (), set corresponding mask entries, and initialize skip weights .
- Phase III: Jointly train all remaining weights (sequential + skip) via SGD with momentum (default $0.9$), decaying learning rate as standard. Nonzeros in the weight and skip tensors are updated; masked entries remain stationary.
No dynamic rewiring is performed by default; masks remain fixed throughout training. Optionally, rewiring could be integrated as a periodic update scheme, but the vanilla regime keeps the allocation static for reproducibility and simplicity (Subramaniam et al., 2022).
5. Connectivity Analysis via Heat Diffusion
To objectively measure restoration of network connectivity, the pruned and skip-augmented network is modeled as a weighted undirected graph , with adjacency matrix indexed such that
The graph Laplacian is formed with . The solution to the continuous-time heat equation,
is the heat kernel , where diagonalizes .
Using the initial layer as the heat source, the vector (with the input indicator vector) yields a heat diffusion signature. Connectivity deviation from the reference dense network is quantified by
with smaller indicating closer structural resemblance to the original graph. N2N-SCIP yields heat-diffusion deviations $1$–$3$ orders of magnitude smaller than pruning alone, quantitatively supporting restoration of backbone-like pathways (Subramaniam et al., 2022).
6. Experimental Validation
N2N-SCIP, implemented in PyTorch, is evaluated on CIFAR-10, CIFAR-100, and ImageNet (ILSVRC’12) with VGG-19 and ResNet-50 architectures. Two pruning baselines are used: RP (Random Pruning at initialization) and CSP (SNIP pruning at initialization). All models are trained for $300$ epochs using SGD with learning rate $0.05$, weight decay , and batch size $128$.
Performance is consistently superior with N2N skip connections:
| Method | CIFAR-10 (10%) | CIFAR-10 (5%) | CIFAR-10 (2%) | CIFAR-100 (10%) | CIFAR-100 (5%) | CIFAR-100 (2%) | |
|---|---|---|---|---|---|---|---|
| RP | RP | 92.08 | 89.43 | 86.52 | 71.23 | 69.82 | 55.43 |
| RP | + N2NSkip-RP | 92.92 | 92.65 | 91.12 | 72.67 | 72.13 | 61.21 |
| CSP | CSP | 92.79 | 92.14 | 90.35 | 72.83 | 71.92 | 59.92 |
| CSP | + N2NSkip-CSP | 93.02 | 92.86 | 92.12 | 73.72 | 73.05 | 65.45 |
ImageNet (ResNet-50, top-1 accuracy, at 20% density):
| Method | 50% | 30% | 20% |
|---|---|---|---|
| CSP | 73.42 | 70.42 | 68.67 |
| + N2NSkip-CSP | 74.59 | 72.89 | 72.09 |
| RP | 72.46 | 68.65 | 65.32 |
| + N2NSkip-RP | 74.12 | 71.19 | 70.03 |
Heat-diffusion connectivity deviations are correspondingly reduced, confirming that N2N-SCIP recovers functional and structural capacities lost to pruning (Subramaniam et al., 2022).
7. Practical Recommendations and Limitations
- Splitting global sparsity equally () performs robustly across large sparsity regimes (5–50× compression).
- Skip weights should be initialized as , using identical learning schedules as backbone weights; SGD with momentum $0.9$ is effective.
- Inference cost is not increased since the skip matrix is as sparse as backbone weights; however, an up-front cost for sampling masks is incurred. Heat-diffusion analyses require computation, but are offline-only.
- No prune-retrain cycles are mandated: N2N-SCIP is a single-shot initialization plus standard training regime.
- Rewiring skip masks dynamically is not the default but can be incorporated if saliency-guided adaptation is desired.
N2N-SCIP supplies a practical, reproducible method for restoring gradient and information pathways in extremely sparse networks by judicious allocation of neuron-to-neuron skip connections within a fixed sparsity constraint, yielding significant gains in both connectivity and predictive performance over baseline pruning (Subramaniam et al., 2022).