PWMPR: Sparse-to-Dense Growth in Neural Nets
- PWMPR is a sparse-to-dense training paradigm that grows subnetworks from a sparse seed using NTK-inspired, path-based edge scoring.
- It employs an L1-norm surrogate score and randomized sampling to add promising edges while avoiding bottlenecks and preserving connectivity.
- The method achieves competitive vision benchmark accuracies at lower cumulative training costs compared to traditional pruning approaches.
Path Weight Magnitude Product-biased Random growth (PWMPR) is a constructive sparse-to-dense training paradigm for neural networks that systematically grows subnetworks from an initial sparse seed using an NTK-inspired, path-based edge selection criterion. PWMPR is designed to automatically discover optimal operating densities for sparse neural networks, in contrast to prevailing pruning-based approaches that assume a fixed target density and often incur higher computational costs. By leveraging a path weight magnitude product score and randomized sampling, PWMPR achieves efficient density discovery and competitive accuracy on standard vision benchmarks at a lower cumulative training cost.
1. Mathematical Formulation and Edge Selection Criterion
PWMPR operates on a feed-forward neural network represented by scalar weights assigned to edges . A path is defined as a sequence of edges connecting an input node to an output node . The path’s weight-product is expressed as
Gebhart et al. demonstrated that the Neural Tangent Kernel (NTK) of such a network can be decomposed into a “path kernel,” with its trace given by
$\Tr(\Pi_\theta) = \sum_p \sum_{(i, j)\in p} \left(\frac{\pi_p(\theta)}{\theta_{ij}}\right)^2.$
The incremental contribution to the trace by adding a new (zero-initialized) edge is
$\Delta\Tr(\Pi_\theta)_{(i, j)} = \sum_{p \ni (i, j)} \left(\prod_{(u, v) \in p \setminus \{(i, j)\}} \theta_{uv}\right)^2,$
but exact computation is intractable.
PWMPR therefore introduces an -norm surrogate, the Path Weight Magnitude Product (PWMP) score: Operationally, node-level scores “complexity” (forward) and “generality” (backward) are computed in a single forward and backward pass, setting using network weights and all-ones input. Candidate edges not in the existing mask are assigned sampling probabilities
and grown by random sampling proportional to , balancing the focus on promising paths with architectural diversity.
2. Constructive Sparse-to-Dense Growth: Operating Principles
The PWMPR workflow follows a strictly constructive paradigm:
A. Initialization: Begin from a sparse seed mask produced by PHEW with initial density chosen to ensure connectivity (i.e., avoiding isolated nodes).
B. Iterative Growth Schedule: At each iteration , starting from mask with density , train for a rough phase of length of a full training budget, then add a fraction of currently active edges (exponential density schedule), initializing new weights to zero.
C. Stopping Criterion: Maintain tuples tracking density and validation accuracy. Accuracies are regressed with a logistic function
and growth is halted at the smallest for which . This criterion signals the effective saturation of accuracy with respect to network density.
At termination, a final “extensive” training phase is conducted either from scratch or continuing from the stopped mask.
3. Theoretical Motivation: NTK and Bottleneck Avoidance
PWMPR is theoretically motivated by the functional form of the NTK in overparameterized networks. The trace $\Tr(\Pi_\theta)$ measures the global sum of squared path-derivatives and thus encodes average curvature in parameter space. Adding edges with high $\Delta\Tr$ increases NTK eigenvalues and thus accelerates convergence during training.
Since explicit computation of $\Delta\Tr$ is cubic in path count, PWMPR leverages the tractable -norm surrogate. Crucially, naïvely optimizing (e.g., greedily adding top-scoring edges) leads to bottleneck structures concentrating edges on few nodes, which can degrade generalization. Random sampling proportional to (rather than deterministic selection) empirically mitigates bottlenecks, maintaining higher average weighted -core connectivity during growth and improving path coverage and network robustness.
4. Algorithmic Description
The PWMPR algorithm proceeds as follows:
- Initialize with sparse mask , ensure avoids isolated nodes.
- Repeat: For
- Train for steps.
- Compute and via two sparse passes.
- For all missing edges, assign .
- Randomly sample new edges (without replacement) according to .
- Add to and initialize .
- Record , refit the logistic model, check if stopping threshold is met.
- Final Phase: Retrain comprehensively on the stopped mask.
Key hyperparameters are growth ratio , rough-phase fraction , and a 95% logistic-fit threshold.
5. Computational Cost and Empirical Performance
Cumulative training cost is measured as total sparse-FLOPs across all growth iterations, normalized to the cost of full dense training. Empirical evaluation demonstrates:
| Dataset/Model | PWMPR Cost (× dense) | IMP-C Cost (× dense) | Density at Stop | Accuracy |
|---|---|---|---|---|
| CIFAR-10/ResNet-32 | ≈1.5 | ≈3.2 | ≈40% | ≈93.5% |
| CIFAR-100/ResNet-56 | ≈1.5 | ≈3.5 | ≈30% | ≈70.2% |
| TinyImageNet/ResNet-18 | ≈2.0 | ≈4.5 | ≈40% | ≈66.0% |
| TinyImageNet/ViT | ≈1.8 | ≈3.2 | ≈50% | ≈62.1% |
On ImageNet/ResNet-50, PWMPR achieves 71.0 ± 0.04% at 10% density and 73.2 ± 0.13% at 20% density, trailing RigL and SparseMomentum by 1–2%. The method consistently locates “lottery ticket” subnetworks at less than half the cost of standard iterative magnitude pruning.
PWMPR’s randomized growth outperforms deterministic Path Weight Magnitude Product (PWMP), as well as simple random growth (RG) and global greedy (GG) heuristic baselines across all evaluated settings.
6. Advantages, Limitations, and Scope for Extension
Advantages
- Density Discovery: PWMPR eliminates the need for a pre-specified target density, autonomously identifying the lowest sufficient density via its logistic-fit stopping rule.
- Efficiency: Cumulative compute is 1.5–2× dense training, compared to 3–4× for IMP-based methods.
- Generalization: Stochastic sampling mitigates bottleneck formation, preserving connectivity and broad path coverage.
- Implementation Simplicity: Requires two sparse passes per growth phase for path-based scoring.
Limitations and Open Directions
- Immutability of Existing Edges: Inability to prune already-added edges drives final densities higher than those produced by optimal pruning (IMP-C).
- Attention Mechanisms: The path-kernel surrogate underlying PWMPR does not apply to query/key projections in attention, as their magnitudes are decoupled by the softmax; attention-specific extension is an open problem.
- Stopping Heuristic: The logistic-fit rule is a simple heuristic; more principled stopping criteria could further optimize density discovery.
- Domain Coverage: Experiments are limited to vision tasks; extension to language modeling and sequential domains is unexplored.
- Potential Hybridization: Hybrid grow-prune regimes may combine constructive and destructive sparsification for even better “winning ticket” discovery at sparser densities.
7. Significance and Paradigm Shift
PWMPR reframes the network sparsification problem from destructive pruning to constructive growth, grounded in NTK-theoretic justification and path-based combinatorics. Its balance of principled score-driven edge addition, randomized exploration, and automatic density selection establishes growth-based density discovery as a complementary and competitive alternative to iterative pruning and dynamic sparsification. PWMPR’s observed cost advantage across standard visual benchmarks supports its adoption in scenarios where training efficiency and subnet structure discovery are both critical, signaling a renewed focus on constructive paradigms in sparse neural network research (Yao et al., 30 Sep 2025).