Optimal Lottery Tickets via SubsetSum
- The paper establishes that sufficient over-parameterization in random neural networks guarantees the existence of sparse subnetworks ('lottery tickets') that approximate any target network to within ε-accuracy via subset-sum reductions.
- The methodology reduces the pruning problem to high-dimensional SubsetSum challenges, achieving logarithmic over-parameterization bounds and supporting both unstructured and structured sparsity results.
- Experimental validations confirm that SubsetSum-inspired pruning not only matches but can outperform naïve wider network designs, paving the way for more efficient CNN and fully-connected network approximations.
Optimal Lottery Tickets via SubsetSum refers to a line of rigorous results establishing that, given a sufficiently over-parameterized random neural network, there exist sparse subnetworks ("lottery tickets")—obtainable solely via pruning—capable of closely approximating any target neural network in a given family. The key analytical technique is a reduction of the network pruning problem to random instances of the SubsetSum problem, which enables the derivation of tight over-parameterization requirements and informs both unstructured and structured sparsity results.
1. Theoretical Framework and Problem Formalization
The strong Lottery Ticket Hypothesis (LTH) posits that any target network from, e.g., the family of depth-, width-, norm-bounded ReLU networks
can, with high probability, be -approximated on the unit ball (in norm, unless stated otherwise) by a subnetwork of a sufficiently wide, randomly initialized network , after pruning. The random network is constructed as follows: \begin{align*} g(x) &= M_{2l} \sigma(M_{2l-1}\cdots \sigma(M_1 x)), \qquad\ M_j&\text{ entries i.i.d. from a standard distribution (e.g., }\mathrm{Unif}[-1,1]\text{ or }N(0,1)). \end{align*} A pruned subnetwork is specified by binary masks applied entry-wise to each .
The goal is to establish tight bounds on the required width*over-parameterization* as a function of , , and the desired accuracy , and to analyze the structured constraints under which such tickets can exist.
2. Reduction to the SubsetSum Problem
A critical observation underlying the formal theory is that the core difficulty in approximating a target network by sparse selection from a wider random network (via pruning) can be formulated as a sequence of (potentially high-dimensional) Random SubsetSum (RSSP) problems.
The main technical challenge involves showing that, with high probability, for any fixed target parameter (or ), it is possible to select a binary subset (mask) of random vectors (weights) such that
where the are i.i.d. from an appropriate initialization distribution. For linear links, Lueker's SubsetSum theorem guarantees the existence of such subsets with samples. This result extends to higher dimensions and entire network layers through block/grouped constructions.
For convolutional and deep architectures, additional structure emerges. Specifically, the multidimensional random Subset-Sum (MRSS) variant for "normally-scaled normal" (NSN) vectors—where for i.i.d.—captures the dependencies induced by parameter sharing and structured sparsity constraints in CNNs.
3. Main Results: Existence and Optimality of Lottery Tickets
The primary theorems show that a random (twice as deep) ReLU network of logarithmic (in width/depth/precision) over-parameterization suffices for universal -approximation of all target networks (Pensia et al., 2020), and this is close to optimal for constant depth.
Precise Statement (Fully-Connected Networks)
Let be target network depth, layer widths, . Construct as above with depth $2l$, choosing for each :
- , universal. Then with probability , for every there exist binary masks such that the pruned satisfies
This construction—based on repeated reduction to (multivariate) subset-sum—yields exponential improvement over earlier polynomial over-parameterization bounds. For depth-constant networks, the width must be (Pensia et al., 2020).
Structured Strong Lottery Tickets in CNNs
For convolutional networks, the results leverage the MRSS for NSN vectors (Cunha et al., 2023). Given a target CNN of depth , kernel sizes , and channel dimensions , and a random CNN with a pair of over-parameterized layers per target layer, applying structured (channel-blocked and filter-removal) pruning yields:
If , then, with high probability, there exists a pruned subnetwork such that
where is obtained via a –channel–blocked mask at layer $2i-1$ and filter pruning at $2i$. Thus, a polynomially-sized random CNN admits structured lottery tickets capable of uniformly approximating any smaller network.
4. Proof Outlines and Key Lemmas
Stepwise Approximation is achieved by:
- Single-weight Approximation: For each scalar target weight , with random variables, one can use subset-masks to reach arbitrary precision.
- Single-neuron Approximation: Construction of block-diagonal weight matrices partitions the random weights into groups, each approximating a target scalar as in step 1.
- Single-layer Approximation: Matrix multiplications are separately approximated through repeated neuron-wise reductions
- Full-network Approximation: The above steps are stacked, with errors controlled and propagated under the $1$-Lipschitzness of and norm bounds, which results in a telescoping bound for the global error.
- MRSS in Structured Pruning: For convolutional architectures under channel-block and filter-removal constraints, the NSN-based MRSS theorem ensures a high-probability existence of suitable subsets in high dimensions. The block structures mirror those seen in practical structured pruning.
5. Experimental Validation and Numerical Results
Experiments validate the theoretical claims on practical feedforward networks and convolutional models (Pensia et al., 2020). Key findings:
- On MNIST, each target weight in a 2-layer 500-hidden-unit net was matched to within 0.01, and the resulting pruned network retained the original 97.19% test accuracy. A weight-by-weight mixed-integer programming approach was used (total solver time: 21.5 h for 21 variables per weight, 36 cores).
- In the same setting, SubsetSum-inspired layered random nets matched or outperformed naïve wider networks in the accuracy-vs-parameter tradeoff, confirming the necessity of the SubsetSum reduction in sparse regimes. The Edge-Popup algorithm was used for mask selection without retraining.
| Architecture | Task | Pruning Strategy | Result |
|---|---|---|---|
| 2-layer FC (500 hid.) | MNIST | SubsetSum (per w) | Exact acc. recovery, MIP for each weight |
| LeNet5 | MNIST | Structured, Edge-Popup | SubsetSum structure matches naïve wide nets |
These findings confirm the tightness of the log-overparameterization bound in practical regimes and the transferability of SubsetSum-inspired architectures.
6. Implications and Research Directions
- The log-overparameterization result renders earlier polynomial-width bounds obsolete (for fully connected nets), and further establishes the SubsetSum approach as nearly optimal for constant depth.
- The existence of structured strong lottery tickets implies that meaningful computational and memory savings are theoretically achievable without relaxing approximation guarantees, provided the pruning respects block/filer constraints (Cunha et al., 2023).
- The reduction of ticket search to SubsetSum invites the development of polynomial-time heuristics (such as relaxed integer programming or greedy selection) for practical structured lottery ticket discovery at initialization.
- For CNNs, the MRSS for NSN vectors addresses the precise dependencies arising from convolutional parameter-sharing, providing a foundation for extensions to other structured sparsity regimes.
- A plausible implication is that further generalizations of SubsetSum, possibly with application-specific constraints, may govern the existence of lottery tickets in increasingly complex or structured network families.
7. Connections and Limitations
The analytic framework relies heavily on the existence results for SubsetSum and their high-dimensional variants. Extensions to other initialization distributions are possible, provided the distribution places sufficient mass in intervals containing zero. The log-overparameterization construction is, up to constants and for constant depth, information-theoretically tight (Pensia et al., 2020). However, constants can be large (e.g., ), and for increasing depth, polynomial dependencies arise in the CNN regime due to the compounded structure of convolution and block constraints (Cunha et al., 2023).
Experimental validation confirms the theoretical claims in low-dimensional settings but scaling mixed-integer optimization to very large networks remains a significant computational challenge. Structured pruning extends theoretical guarantees into practical regimes for resource-constrained inference, especially in convolutional architectures.
Advances in random SubsetSum analysis and high-dimensional concentration inequalities are fundamental to further refinements. Future research may exploit these links to devise automated and efficient pruning algorithms, further closing the gap between theory and scalable practice in neural network sparsification.