Randomly Weighted Subnetworks
- Randomly weighted subnetworks are sparsely activated masks that, when isolated from randomly initialized networks, achieve nontrivial and sometimes competitive performance.
- They are discovered using diverse methods such as score-based gradient masking, stochastic optimization, and evolutionary search to efficiently navigate the combinatorial mask space.
- Empirical and theoretical studies highlight their role in model compression, robustness against adversarial attacks, and rapid architecture prototyping under overparameterization.
A randomly weighted subnetwork is a subset of the parameters of a (potentially untrained) randomly initialized neural network that, when isolated by masking the rest, achieves nontrivial or even competitive performance on a given predictive or generative task. The systematic paper of these objects yields rich architectural, theoretical, and practical insights, transforming the perspective on expressivity, redundancy, overparameterization, model compression, and the mechanisms by which neural networks achieve high accuracy. Recent results have established not only the existence, but—remarkably—the abundance of such subnetworks, sometimes matching the accuracy of fully trained dense models while using only fixed, untrained weights, and even supporting quantization to binary weights and activations.
1. Mathematical Foundations and Theoretical Guarantees
The canonical setting considers a neural network , where are weights initialized from a random distribution (e.g., i.i.d.\ Gaussian, signed constant, uniform). A randomly weighted subnetwork is specified via a binary mask , so that the masked weights are . The resulting subnetwork is . The mask is typically constrained to select no more than weights, or some fixed layerwise sparsity.
A central theoretical result (Sreenivasan et al., 2021, Diffenderfer et al., 2021) asserts that, for fully connected ReLU networks of depth and maximal width , any target ReLU network with bounded weights can be -approximated (in uniform norm on the unit ball) by a pruned subnetwork carved out of a sufficiently wide and deep random network, with overwhelming probability. Specifically, for random networks where all non-final weights are i.i.d. , poly-logarithmic overparameterization suffices:
- For target error , width , and depth , set and consider an overparameterized network with
and all entries in (except scaling the final layer). Then, with probability at least , every such target network is -approximated by some masked subnetwork in .
This establishes a Strong Lottery Ticket Hypothesis in the binary setting: the amplitude of random weights is inessential, and even purely signed randomizations suffice, provided sufficient (but still mild) overparameterization.
Furthermore, in the Multi-Prize Lottery Ticket Hypothesis (Diffenderfer et al., 2021), it is shown that not only do such subnetworks exist for a given random initialization, but that many distinct high-performing masks exist in large random networks (see also mask diversity observations below).
2. Algorithms for Subnetwork Discovery
The discovery of high-performing subnetworks in a randomly weighted network is a combinatorial optimization problem. Methods range from gradient-based surrogate optimization through straight-through estimators, to bilevel robust optimization, to discrete search heuristics or even future quantum approaches.
Score-based Gradient Masking (Edge-Popup, Biprop, Supermask)
- Edge-Popup (Ramanujan et al., 2019): For each weight, maintain a trainable score ; at each iteration, build the binary mask by retaining the top of scores in each layer. Only the scores are updated via SGD using a straight-through estimator, while the actual weights remain fixed. The mask is thus adapted to maximize task performance under the fixed random weights.
- Biprop (Diffenderfer et al., 2021): For binary-weighted networks, iteratively interleaves masking, quantization, and rescaling. Maintains binary weights and mask , updates a mask score via gradient descent or straight-through, prunes the least salient connections, and rescales active weights for quantization robustness.
Stochastic Mask Optimization
- Gumbel-Softmax Subnetworks (Dupont et al., 2022): Attaches a Gumbel-Softmax–parameterized Bernoulli to each weight (or group of weights), learns mask probabilities (and layerwise rescale factors) via SGD and the straight-through Gumbel-Softmax estimator, and extracts a performant sparse subnetwork after convergence.
Greedy Forward Selection
- Greedy Subnetwork Selection (Ye et al., 2020): Starts from an empty subnetwork, iteratively and greedily adds the single neuron or unit that minimizes loss (validated via a minibatch), continuing until sparsity or loss goals are met. Guarantees convergence, improving to under interior-point (overparameterization) conditions.
Evolutionary and Discrete Methods
- Evolutionary Search (Shen et al., 2021): Mask search is performed via an evolutionary algorithm with population selection, mutation, bitwise crossover, and optional hill-climbing, optimizing a non-differentiable metric (e.g. BLEU for translation tasks).
Quantum Algorithms (Prospective)
Quantum algorithms could potentially speed up subnetwork selection by leveraging amplitude amplification (Grover search), quantum annealing for energy minimization over mask combinatorics, or hybrid variational approaches (QAOA, VQE). These approaches are currently prospective, with complexity reductions realized for local (layerwise) K-bit subproblems (Whitaker, 2023).
3. Empirical Findings and Performance Characteristics
Extensive empirical evidence across modalities, architectures, and tasks underlines the power and ubiquity of randomly weighted subnetworks.
- On ImageNet: In a WideResNet-50 initialized with random (Signed Kaiming Constant) weights, the Edge-Popup algorithm can identify a subnetwork that matches the accuracy of a fully trained ResNet-34, and approaches that of a trained ResNet-50 (Ramanujan et al., 2019).
- Binary MPTs: On CIFAR-10, a binary subnetwork with pruning and 1/32 quantization achieves 94.8% Top-1 accuracy (surpassing the baseline) (Diffenderfer et al., 2021). On ImageNet, binary MPT-1/32 nets achieve 74.03% Top-1 with 13.7M parameters.
- Robustness: Robust Scratch Tickets (RSTs) exist in randomly initialized networks—adversarial optimization over masks produces subnetworks with robust accuracy matching (or exceeding) adversarially trained dense nets under both PGD and AutoAttack, at equivalent or higher sparsities (Fu et al., 2021).
- Generative tasks: In U-Net architectures for audio source separation, fixing either the encoder or decoder to random weights (and training the other half) yields performance that strongly correlates with the fully trained model, enabling rapid architectural search (Chen et al., 2019).
- NLP and Transformers: Discovered subnetworks in randomly weighted one-layer Transformers reach of the BLEU score of a trained small Transformer on IWSLT14, and on WMT14 (Shen et al., 2021).
- Mask Diversity: Multiple distinct high-accuracy masks can be found in the same randomly weighted model, with low Jaccard similarity even under fixed architecture and optimization schedule, demonstrating the abundance of "winning tickets" (Gorbett et al., 2023).
4. Overparameterization, Expressivity, and Mask Geometry
The expressivity of randomly weighted subnetworks relies heavily on overparameterization. Classical dense random networks implicitly embed almost all target functions as subnetworks, with the required width and depth overhead being:
- Binary weights: Only overparameterization is needed for universal approximation of depth-, width- ReLU networks (Sreenivasan et al., 2021).
- Continuous weights: Older results required overhead (e.g., width ) [Malach et al., Pensia et al.].
- Pruning vs. fixed sign assignment: The expressive power of pruned binary subnetworks strictly exceeds that of unpruned architectures of the same width [(Sreenivasan et al., 2021), Prop 2].
The location of non-zero weights ("mask geometry") is the key to the existence of high-performing subnetworks; the actual values (signs/magnitudes) are, up to scaling, less important provided sufficient random diversity and network size. For adversarial robustness, mask geometry can encode robust behaviors absent from the dense parent (Fu et al., 2021), and poor inter-subnetwork adversarial transfer is observed when switching between different masks.
5. Practical Implications, Limitations, and Applications
Model Compression and Efficient Inference
Randomly weighted subnetworks enable model compression without retraining the original weights: mask storage and possibly a few layerwise points for quantization suffices. After discovery:
- Only mask bits and (if needed) binary weights (XNOR-friendly) are required for inference (Diffenderfer et al., 2021).
- Hardware inference can exploit static random weight structure for efficient lookups (Ramanujan et al., 2019).
Robustness and Defense
Robust Scratch Tickets and the Random RST Switch (R2S) ensemble can outperform classical adversarial defenses, exploiting poor attack transferability across masks (Fu et al., 2021). Storage cost is minimal (pool of masks).
Architecture Search
Because subnetwork performance strongly correlates with premium architecture designs (even when untrained), randomly weighted subnetworks enable efficient architecture prototyping before full-scale training (Chen et al., 2019).
Limitations
- The theoretical results are fundamentally nonconstructive: efficient discovery of the exact optimal mask is a combinatorial problem, and current algorithms are only heuristically effective—exponential time may be required in the worst case (Sreenivasan et al., 2021, Ramanujan et al., 2019).
- Practical performance degrades at extreme sparsity, narrow/deep or shallow architectures, or when careful finite-precision scaling is neglected (Sreenivasan et al., 2021, Ramanujan et al., 2019).
- Adversarially robust subnetworks require solving bilevel min-max optimization, which is computationally intensive (Fu et al., 2021).
- The need for overparameterization (logarithmic or higher) limits ultra-low footprint applications unless iterative recycling, mask diversity, or other memory-reuse techniques are deployed (Gorbett et al., 2023).
6. Extensions, Open Questions, and Future Directions
Key future directions include:
- Combinatorial Counting and Landscape: Quantifying the number of distinct winning tickets and their landscape structure for given architectures and sparsities; precise combinatorial bounds remain open (Gorbett et al., 2023).
- Advanced Algorithms: Approaches leveraging quantum optimization (Whitaker, 2023), advanced mask optimization (mask relaxation, stochasticity), and iterative recycling schemes for eliminating storage/reinitialization bottlenecks (Gorbett et al., 2023).
- Generalization Domains: Extending mask-based approaches to NLP transformers, RL, unsupervised/generative models, and new architectures.
- Training-free NAS: Fully train-free neural architecture search by probing the structure of random subnetworks as performance proxies, before committing resources to training (Chen et al., 2019).
- Mask Geometry and Theoretical Analysis: Deepening understanding of the interaction between mask geometry, network depth/width, and expressivity, as well as the role of random initialization distributions.
Randomly weighted subnetworks represent a paradigm shift in the understanding of deep network redundancy, expressivity, and the sources of trainable capacity. Model design, compression, and defense methodologies can leverage their combinatorial richness for both practical gains and new theoretical insights into the nature of overparameterized function classes.