Papers
Topics
Authors
Recent
2000 character limit reached

N2N-SCIP: Sparse DNN Pruning with Skip Connections

Updated 1 December 2025
  • The paper introduces N2N-SCIP, a framework combining single-shot network pruning with learnable neuron-to-neuron skip connections to preserve gradient flow in extremely sparse models.
  • It enforces a fixed global sparsity budget by partitioning nonzero parameters equally between sequential weights and skip connections, ensuring controlled compression.
  • Empirical results on CIFAR and ImageNet benchmarks demonstrate improved connectivity and top-1 accuracy, significantly outperforming standard pruning methods.

N2N-SCIP denotes a pruning-and-skip connection framework for learning highly sparse deep neural networks by combining single-shot network pruning at initialization with the integration of sparse, learnable neuron-to-neuron skip (N2NSkip) connections, while strictly maintaining a fixed global sparsity budget. Developed in the context of enhancing the connectivity and performance of extremely sparse pruned models, N2N-SCIP offers a rigorous algorithmic scheme for sampling, training, and analyzing such networks, supported by graph-theoretic connectivity metrics and large-scale empirical evaluation on standard benchmarks (Subramaniam et al., 2022).

1. Foundational Formulation and Pruning Regime

N2N-SCIP begins from a standard LL-layer feedforward architecture parameterized by weight tensors WiRni×ni1W_i \in \mathbb{R}^{n_i \times n_{i-1}} for i=1,2,,Li = 1,2,\dots,L, where nin_i is the neuron (or channel) count at layer ii. A layer-wise density ρi(0,1)\rho_i \in (0,1) specifies the fraction of nonzero weights retained post-pruning. Pruning proceeds at initialization, directly imposing binary masks Mi{0,1}ni×ni1M_i \in \{0,1\}^{n_i \times n_{i-1}} such that

WiMi0=ρiWi0,\|W_i \odot M_i\|_0 = \rho_i \|W_i\|_0,

with \odot denoting the elementwise product. The aggregate number of active (sequential, i.e., backbone) weights is

Sseq=i=1LWiMi0.S_{\text{seq}} = \sum_{i=1}^L \|W_i \odot M_i\|_0.

Pruning criteria may be random or based on connection sensitivity (e.g., SNIP), but N2N-SCIP requires only an initial mask—no iterative prune-retrain cycles are needed (Subramaniam et al., 2022).

2. Neuron-to-Neuron Skip Connection Model

Beyond pruned sequential weights, N2N-SCIP introduces learnable skip weights ωuv\omega_{u \to v}, connecting any neuron uu in layer ii to any neuron vv in a deeper layer j>ij>i. These are collected into a sparse tensor

Ω={ωuv} for (u,i)(v,j),i<j,\Omega = \{\omega_{u \to v}\} \text{ for } (u,i) \to (v,j), i<j,

with sparsity enforced using binary skip masks Mijskip{0,1}nj×niM^{\text{skip}}_{i \to j} \in \{0,1\}^{n_j \times n_i}: ωuv is active iff Mijskip[v,u]=1.\omega_{u \to v} \text{ is active iff } M^{\text{skip}}_{i \to j}[v,u] = 1. In forward propagation, the pre-activation at layer jj generalizes to

zj=Wj(aj1)+i<j(ΩijMijskip)(ai),z^j = W_j(a^{j-1}) + \sum_{i<j} (\Omega_{i \to j} \odot M^{\text{skip}}_{i \to j})(a^i),

where aj=g(zj)a^j = g(z^j) for nonlinearity gg.

This structure augments gradient pathways, addressing limitations of extreme pruning on information and gradient flow, especially at high compression ratios (Subramaniam et al., 2022).

3. Sparsity Budgeting and Skip Sampling

A fixed total parameter budget Stotal=ρglobaliWi0S_{\text{total}} = \rho_{\text{global}} \sum_i \|W_i\|_0 is enforced. This is split between sequential and skip connections: Sseq+Sskip=Stotal,Sskip=ΩMskip0=i<jMijskip0.S_{\text{seq}} + S_{\text{skip}} = S_{\text{total}}, \qquad S_{\text{skip}} = \|\Omega \odot M^{\text{skip}}\|_0 = \sum_{i<j} \|M^{\text{skip}}_{i \to j}\|_0. Typically, ρseq=ρskip=12ρglobal\rho_{\text{seq}} = \rho_{\text{skip}} = \tfrac{1}{2}\rho_{\text{global}}. Both sequential and skip masks are sampled (randomly or by importance) up-front to respect StotalS_{\text{total}}, blocking all further growth of nonzeros after initialization.

Sampling ensures that skip connections do not inflate parameter count while enabling denser inter-layer connectivity, especially across distant layers, mitigating typical layerwise bottlenecks induced by pruning (Subramaniam et al., 2022).

4. Algorithmic Procedure and Training

N2N-SCIP operates in three phases:

  • Phase I: Initialize, prune backbone weights using the selected criterion, compute SseqS_{\text{seq}}.
  • Phase II: Sample SskipS_{\text{skip}} skip edges among all possible (u,v,i,j)(u,v,i,j) pairs (i<ji<j), set corresponding mask entries, and initialize skip weights ωuvN(0,σ2)\omega_{u \to v} \sim \mathcal{N}(0,\sigma^2).
  • Phase III: Jointly train all remaining weights (sequential + skip) via SGD with momentum (default $0.9$), decaying learning rate as standard. Nonzeros in the weight and skip tensors are updated; masked entries remain stationary.

No dynamic rewiring is performed by default; masks remain fixed throughout training. Optionally, rewiring could be integrated as a periodic update scheme, but the vanilla regime keeps the allocation static for reproducibility and simplicity (Subramaniam et al., 2022).

5. Connectivity Analysis via Heat Diffusion

To objectively measure restoration of network connectivity, the pruned and skip-augmented network is modeled as a weighted undirected graph G=(V,E)G=(V, E), with adjacency matrix AA indexed such that

Auv={Wi[u,v]if uv is sequential ωuvif uv is N2N skip 0otherwiseA_{uv} = \begin{cases} |W_i[u,v]| & \text{if } u \to v \text{ is sequential} \ |\omega_{u \to v}| & \text{if } u \to v \text{ is N2N skip} \ 0 & \text{otherwise} \end{cases}

The graph Laplacian L=DAL = D - A is formed with Duu=vAuvD_{uu} = \sum_v A_{uv}. The solution to the continuous-time heat equation,

dH(t)dt=LH(t),H(0)=In×n,\frac{dH(t)}{dt} = -L H(t), \qquad H(0) = I_{n \times n},

is the heat kernel H(t)=exp(Lt)=Uexp(Λt)UH(t) = \exp(-Lt) = U \exp(-\Lambda t) U^\top, where UΛUU\Lambda U^\top diagonalizes LL.

Using the initial layer as the heat source, the vector s(t)=H(t)as(t) = H(t)a (with aa the input indicator vector) yields a heat diffusion signature. Connectivity deviation from the reference dense network is quantified by

F=sref(t)sp(t)2,F = \|s_{\text{ref}}(t) - s_p(t)\|_2,

with smaller FF indicating closer structural resemblance to the original graph. N2N-SCIP yields heat-diffusion deviations $1$–$3$ orders of magnitude smaller than pruning alone, quantitatively supporting restoration of backbone-like pathways (Subramaniam et al., 2022).

6. Experimental Validation

N2N-SCIP, implemented in PyTorch, is evaluated on CIFAR-10, CIFAR-100, and ImageNet (ILSVRC’12) with VGG-19 and ResNet-50 architectures. Two pruning baselines are used: RP (Random Pruning at initialization) and CSP (SNIP pruning at initialization). All models are trained for $300$ epochs using SGD with learning rate $0.05$, weight decay 5×1045\times 10^{-4}, and batch size $128$.

Performance is consistently superior with N2N skip connections:

ρ\rho Method CIFAR-10 (10%) CIFAR-10 (5%) CIFAR-10 (2%) CIFAR-100 (10%) CIFAR-100 (5%) CIFAR-100 (2%)
RP RP 92.08 89.43 86.52 71.23 69.82 55.43
RP + N2NSkip-RP 92.92 92.65 91.12 72.67 72.13 61.21
CSP CSP 92.79 92.14 90.35 72.83 71.92 59.92
CSP + N2NSkip-CSP 93.02 92.86 92.12 73.72 73.05 65.45

ImageNet (ResNet-50, top-1 accuracy, at 20% density):

Method 50% 30% 20%
CSP 73.42 70.42 68.67
+ N2NSkip-CSP 74.59 72.89 72.09
RP 72.46 68.65 65.32
+ N2NSkip-RP 74.12 71.19 70.03

Heat-diffusion connectivity deviations FF are correspondingly reduced, confirming that N2N-SCIP recovers functional and structural capacities lost to pruning (Subramaniam et al., 2022).

7. Practical Recommendations and Limitations

  • Splitting global sparsity equally (ρseq=ρskip=12ρ\rho_{\text{seq}} = \rho_{\text{skip}} = \frac{1}{2}\rho) performs robustly across large sparsity regimes (5–50× compression).
  • Skip weights should be initialized as ωuvN(0,0.012)\omega_{u \to v} \sim \mathcal{N}(0, 0.01^2), using identical learning schedules as backbone weights; SGD with momentum $0.9$ is effective.
  • Inference cost is not increased since the skip matrix is as sparse as backbone weights; however, an up-front cost for sampling masks is incurred. Heat-diffusion analyses require O(n3)O(n^3) computation, but are offline-only.
  • No prune-retrain cycles are mandated: N2N-SCIP is a single-shot initialization plus standard training regime.
  • Rewiring skip masks dynamically is not the default but can be incorporated if saliency-guided adaptation is desired.

N2N-SCIP supplies a practical, reproducible method for restoring gradient and information pathways in extremely sparse networks by judicious allocation of neuron-to-neuron skip connections within a fixed sparsity constraint, yielding significant gains in both connectivity and predictive performance over baseline pruning (Subramaniam et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to N2N-SCIP.