Papers
Topics
Authors
Recent
2000 character limit reached

Edge-Popup: Efficient Subnetwork Discovery

Updated 25 November 2025
  • Edge-Popup algorithm identifies highly effective subnetworks in fixed, randomly initialized deep networks by optimizing binary edge selection masks.
  • It employs continuous popup scores and differentiable top-k gating to select a subset of connections without updating underlying weights.
  • Empirical benchmarks on CIFAR-10 and ImageNet reveal that these untrained subnetworks can match or exceed the accuracy of fully trained dense models.

The Edge-Popup algorithm identifies performant subnetworks within randomly initialized neural networks by learning only a binary edge selection mask, leaving the underlying weight values strictly fixed at their initial random settings. The central contribution is to demonstrate that, in sufficiently wide and deep architectures, specific sparsified subnetworks—identified solely via edge selection at initialization—can match or exceed the accuracy of fully trained models. The algorithm operates by assigning a continuous “popup score” to each edge, using differentiable top-kk gating to select the most promising connections on a per-layer basis, and optimizing only these scores using standard gradient-based methods. As a result, Edge-Popup reveals highly capable “untrained subnetworks,” showing that performance comparable to standard dense training can be achieved by selecting which weights to keep, not by learning their values (Ramanujan et al., 2019).

1. Problem Statement and Motivation

Edge-Popup concerns the subnetwork selection problem: given a deep network with weights wew_e drawn randomly from a fixed initialization (e.g., Kaiming normal), determine whether there exists—a priori—a small subnetwork capable of high performance on complex tasks by merely selecting a subset of edges per layer. The motivating question is whether training the weights themselves is necessary, or if the combination of initialization and selective masking suffices to yield near state-of-the-art results. This investigation reveals that large neural architectures typically hide performant subnetworks even before training commences, and that systematic subnetwork discovery is computationally feasible at substantial scale.

2. Mathematical Framework for Subnetwork Discovery

Let LL denote network depth, with each weight wew_e fixed from initialization. The subnetwork selection task is formalized as a constrained optimization:

  • Associate each edge ee with a parameter seR+s_e \in \mathbb{R}^+ representing its popup score.
  • Define a gating function h(se){0,1}h(s_e) \in \{0,1\}, which is 1 if ses_e ranks within the top k100%k \cdot 100\% popup scores for its layer, 0 otherwise.
  • The forward computation at unit vv becomes Iv=uvwu,vzuh(su,v)I_v = \sum_{u \rightarrow v} w_{u,v} \cdot z_u \cdot h(s_{u,v}).
  • The empirical loss L(x;wh(s))L(x; w \cdot h(s)) is minimized over ss, subject to the per-layer cardinality constraint induced by kk.

The only trainable parameters are {se}\{s_e\} (popup scores); all wew_e are immutable.

Edge-Popup parameterizes each edge with a popup score and imposes per-layer sparsity via top-kk selection on those scores. The binary mask h(se)h(s_e) is determined by selecting the kk fraction of edges with the highest se|s_e| in each layer. During optimization, the algorithm employs a custom autograd function implementing the forward top-kk operation, while in the backward pass it propagates gradients unmodified as a straight-through estimator. The effective weight on the forward pass is weh(se)w_e h(s_e).

Core Algorithm (PyTorch Pseudocode Excerpt)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class GetSubnet(torch.autograd.Function):
    @staticmethod
    def forward(ctx, scores, k):
        flat = scores.view(-1)
        _, idx = torch.sort(flat)
        cutoff = int((1 - k) * flat.numel())
        mask = torch.ones_like(flat)
        mask[idx[:cutoff]] = 0
        return mask.view_as(scores)

    @staticmethod
    def backward(ctx, grad_mask):
        return grad_mask, None

class EdgePopupConv(nn.Conv2d):
    def __init__(self, *args, k=0.3, **kwargs):
        super().__init__(*args, **kwargs)
        self.weight.requires_grad = False
        self.popup_scores = nn.Parameter(torch.empty_like(self.weight))
        torch.nn.init.kaiming_uniform_(self.popup_scores)
        self.k = k
    def forward(self, x):
        mask = GetSubnet.apply(self.popup_scores.abs(), self.k)
        w_eff = self.weight * mask
        return F.conv2d(
            x, w_eff, self.bias, self.stride,
            self.padding, self.dilation, self.groups)

Only the popup scores are optimized—weights and batch normalization parameters remain frozen.

4. Theoretical Insights: Loss Decrease via Edge Swapping

Whenever the ranking of scores results in an edge ee entering and an edge ee' leaving the active subnetwork, the mini-batch loss monotonically decreases under sufficiently smooth LL. The one-dimensional intuition is that for the local update producing a swap in the top-kk order, the change L/Iv(wezuwezu)\partial L / \partial I_v \cdot (w_e z_u - w_{e'} z_{u'}) is negative, formalizing that such greedy exchanges always improve empirical loss during the score optimization process.

5. Empirical Findings and Benchmarks

CIFAR-10 and VGG-Like Architectures

  • As depth and width increase, subnetworks retaining kk\approx 30–70% of edges match the accuracy of fully trained dense nets.
  • Wide Conv6 with k=0.5k=0.5 achieves comparable performance to end-to-end training.

Comparison to Supermask

  • Supermask, which employs stochastic Bernoulli masking, generally yields lower accuracy (∼65% on CIFAR-10) and requires careful hyperparameter tuning.
  • Edge-Popup consistently outperforms Supermask by 10+ points on CIFAR-10 and exhibits easier scalability to ImageNet.

ImageNet

Architecture Nonzero Edges (k=0.3) Top-1 Accuracy (Random) Top-1 Accuracy (Signed Init)
ResNet-50 (25M→7.6M) 7.6M ~61.7%
ResNet-101 (44.5M→13M) 13M ~66.2%
Wide ResNet-50 (69M→21M) 20.6M ~67.95% up to ~73.3%
  • Wide ResNet-50 subnetworks approach or surpass the accuracy of trained ResNet-34 ($21.8$M weights, 73.3%73.3\% top-1).
  • Using signed Kaiming constant initialization (±σ_Kaiming) further improves top-1 accuracy up to ∼73.3%.

This suggests that high-performing subnetworks exist and can be efficiently identified in large fixed-weight architectures.

6. Implementation Practices

  • Weight Initialization: Kaiming normal (σ=√(2/fanin)) is standard; signed Kaiming constant (±σ_Kaiming) achieves marginally better results. Variance scaling is advised: for CNNs, scale σ by 1/k1/\sqrt{k} to maintain forward-pass variance.
  • BatchNorm: All scale (γ\gamma) and bias (β\beta) parameters remain fixed at 1 and 0; they are not trained.
  • Hyperparameters:
    • CIFAR-10: SGD with lr=0.1, momentum=0.9, weight decay=1e-4, batch=128, cosine lr schedule, 100 epochs; Adam (lr≈3e-4) as ablation.
    • ImageNet: Follows the canonical PyTorch ResNet setup (SGD, lr schedule 0.1→0.001 via cosine, batch=256, 90 epochs, weight decay=1e-4, momentum=0.9).
  • Choosing k: 30–70% of edges retained per layer yields optimal or near-optimal performance; lower kk underfits, higher kk trends toward ineffective subnetworks unless significant training is conducted.
  • Extension to ConvNets: Each spatial (in_channel, out_channel, kernel) connection receives its own popup score; top-kk gating is always per layer.

7. Significance and Limitations

Edge-Popup reveals that the conventional understanding of training deep networks—learning over all parameter values—is not strictly required for strong performance in overparameterized settings. Instead, edge selection at random initialization suffices to uncover highly performant subnetworks, provided architectures have sufficient width and depth. A plausible implication is that untrained networks contain a rich structure of functional subnetworks which are discoverable without updating base weights. Further, the deterministic top-kk masking and straight-through estimator enable efficient and scalable optimization for large models, in contrast to stochastic masking alternatives.

The approach depends on high-dimensional architecture; for smaller models or under-aggressive sparsification (k1k \rightarrow 1), performance can degrade or converge slowly. Edge-Popup requires all computational graphs to support masking; practical deployment assumes sufficient hardware memory to instantiate the initial large network. Despite these constraints, the method substantially recasts the role of initialization and connectivity in deep learning optimization (Ramanujan et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Edge-Popup Algorithm.