Edge-Popup: Efficient Subnetwork Discovery
- Edge-Popup algorithm identifies highly effective subnetworks in fixed, randomly initialized deep networks by optimizing binary edge selection masks.
- It employs continuous popup scores and differentiable top-k gating to select a subset of connections without updating underlying weights.
- Empirical benchmarks on CIFAR-10 and ImageNet reveal that these untrained subnetworks can match or exceed the accuracy of fully trained dense models.
The Edge-Popup algorithm identifies performant subnetworks within randomly initialized neural networks by learning only a binary edge selection mask, leaving the underlying weight values strictly fixed at their initial random settings. The central contribution is to demonstrate that, in sufficiently wide and deep architectures, specific sparsified subnetworks—identified solely via edge selection at initialization—can match or exceed the accuracy of fully trained models. The algorithm operates by assigning a continuous “popup score” to each edge, using differentiable top- gating to select the most promising connections on a per-layer basis, and optimizing only these scores using standard gradient-based methods. As a result, Edge-Popup reveals highly capable “untrained subnetworks,” showing that performance comparable to standard dense training can be achieved by selecting which weights to keep, not by learning their values (Ramanujan et al., 2019).
1. Problem Statement and Motivation
Edge-Popup concerns the subnetwork selection problem: given a deep network with weights drawn randomly from a fixed initialization (e.g., Kaiming normal), determine whether there exists—a priori—a small subnetwork capable of high performance on complex tasks by merely selecting a subset of edges per layer. The motivating question is whether training the weights themselves is necessary, or if the combination of initialization and selective masking suffices to yield near state-of-the-art results. This investigation reveals that large neural architectures typically hide performant subnetworks even before training commences, and that systematic subnetwork discovery is computationally feasible at substantial scale.
2. Mathematical Framework for Subnetwork Discovery
Let denote network depth, with each weight fixed from initialization. The subnetwork selection task is formalized as a constrained optimization:
- Associate each edge with a parameter representing its popup score.
- Define a gating function , which is 1 if ranks within the top popup scores for its layer, 0 otherwise.
- The forward computation at unit becomes .
- The empirical loss is minimized over , subject to the per-layer cardinality constraint induced by .
The only trainable parameters are (popup scores); all are immutable.
3. Popup Scores, Top- Masking, and Algorithmic Realization
Edge-Popup parameterizes each edge with a popup score and imposes per-layer sparsity via top- selection on those scores. The binary mask is determined by selecting the fraction of edges with the highest in each layer. During optimization, the algorithm employs a custom autograd function implementing the forward top- operation, while in the backward pass it propagates gradients unmodified as a straight-through estimator. The effective weight on the forward pass is .
Core Algorithm (PyTorch Pseudocode Excerpt)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
class GetSubnet(torch.autograd.Function): @staticmethod def forward(ctx, scores, k): flat = scores.view(-1) _, idx = torch.sort(flat) cutoff = int((1 - k) * flat.numel()) mask = torch.ones_like(flat) mask[idx[:cutoff]] = 0 return mask.view_as(scores) @staticmethod def backward(ctx, grad_mask): return grad_mask, None class EdgePopupConv(nn.Conv2d): def __init__(self, *args, k=0.3, **kwargs): super().__init__(*args, **kwargs) self.weight.requires_grad = False self.popup_scores = nn.Parameter(torch.empty_like(self.weight)) torch.nn.init.kaiming_uniform_(self.popup_scores) self.k = k def forward(self, x): mask = GetSubnet.apply(self.popup_scores.abs(), self.k) w_eff = self.weight * mask return F.conv2d( x, w_eff, self.bias, self.stride, self.padding, self.dilation, self.groups) |
Only the popup scores are optimized—weights and batch normalization parameters remain frozen.
4. Theoretical Insights: Loss Decrease via Edge Swapping
Whenever the ranking of scores results in an edge entering and an edge leaving the active subnetwork, the mini-batch loss monotonically decreases under sufficiently smooth . The one-dimensional intuition is that for the local update producing a swap in the top- order, the change is negative, formalizing that such greedy exchanges always improve empirical loss during the score optimization process.
5. Empirical Findings and Benchmarks
CIFAR-10 and VGG-Like Architectures
- As depth and width increase, subnetworks retaining 30–70% of edges match the accuracy of fully trained dense nets.
- Wide Conv6 with achieves comparable performance to end-to-end training.
Comparison to Supermask
- Supermask, which employs stochastic Bernoulli masking, generally yields lower accuracy (∼65% on CIFAR-10) and requires careful hyperparameter tuning.
- Edge-Popup consistently outperforms Supermask by 10+ points on CIFAR-10 and exhibits easier scalability to ImageNet.
ImageNet
| Architecture | Nonzero Edges (k=0.3) | Top-1 Accuracy (Random) | Top-1 Accuracy (Signed Init) |
|---|---|---|---|
| ResNet-50 (25M→7.6M) | 7.6M | ~61.7% | — |
| ResNet-101 (44.5M→13M) | 13M | ~66.2% | — |
| Wide ResNet-50 (69M→21M) | 20.6M | ~67.95% | up to ~73.3% |
- Wide ResNet-50 subnetworks approach or surpass the accuracy of trained ResNet-34 ($21.8$M weights, top-1).
- Using signed Kaiming constant initialization (±σ_Kaiming) further improves top-1 accuracy up to ∼73.3%.
This suggests that high-performing subnetworks exist and can be efficiently identified in large fixed-weight architectures.
6. Implementation Practices
- Weight Initialization: Kaiming normal (σ=√(2/fanin)) is standard; signed Kaiming constant (±σ_Kaiming) achieves marginally better results. Variance scaling is advised: for CNNs, scale σ by to maintain forward-pass variance.
- BatchNorm: All scale () and bias () parameters remain fixed at 1 and 0; they are not trained.
- Hyperparameters:
- Choosing k: 30–70% of edges retained per layer yields optimal or near-optimal performance; lower underfits, higher trends toward ineffective subnetworks unless significant training is conducted.
- Extension to ConvNets: Each spatial (in_channel, out_channel, kernel) connection receives its own popup score; top- gating is always per layer.
7. Significance and Limitations
Edge-Popup reveals that the conventional understanding of training deep networks—learning over all parameter values—is not strictly required for strong performance in overparameterized settings. Instead, edge selection at random initialization suffices to uncover highly performant subnetworks, provided architectures have sufficient width and depth. A plausible implication is that untrained networks contain a rich structure of functional subnetworks which are discoverable without updating base weights. Further, the deterministic top- masking and straight-through estimator enable efficient and scalable optimization for large models, in contrast to stochastic masking alternatives.
The approach depends on high-dimensional architecture; for smaller models or under-aggressive sparsification (), performance can degrade or converge slowly. Edge-Popup requires all computational graphs to support masking; practical deployment assumes sufficient hardware memory to instantiate the initial large network. Despite these constraints, the method substantially recasts the role of initialization and connectivity in deep learning optimization (Ramanujan et al., 2019).