Strong Lottery Ticket Hypothesis (SLTH)

Updated 9 November 2025

SLTH is a theory stating that an over-parameterized neural network harbors a sparse, untrained subnetwork—identified by a binary mask at initialization—that can approximate any target network.
It employs rigorous probabilistic and combinatorial techniques, using subset-sum concentration inequalities, to guarantee the existence of these subnetworks across various architectures.
Empirical methods like Edge-Popup and genetic algorithms support SLTH's practical application, though discovering the optimal mask remains NP-hard and an open research challenge.

The Strong Lottery Ticket Hypothesis (SLTH) asserts that, within a sufficiently over-parameterized and randomly initialized neural network, there exists a sparse subnetwork—identified purely by masking weights at initialization—that approximates the function of any target network of the same architectural family without requiring any gradient-based training. The SLTH is a formal strengthening of the original Lottery Ticket Hypothesis, removing the necessity of retraining or fine-tuning, and is supported by rigorous probabilistic analysis across a wide range of neural network classes and settings.

1. Formal Statement and Theoretical Foundations

The SLTH postulates that for a given "target" neural network $f:\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}$ (e.g., a fixed $L$ -layer ReLU network), a randomly initialized, overparameterized "source" network $g:\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}$ —possibly of double depth and increased width—contains with high probability a binary mask $S$ such that the masked, untrained subnetwork $g_S$ approximates $f$ to within any desired accuracy $\eta>0$ on a bounded input domain: $\sup_{\|x\|\leq 1} \|f(x) - g_S(x)\| < \eta$ This result was first established for dense multi-layer perceptrons (MLPs) using random weight initialization and is formally stated in Theorem 3.1 of Malach et al. (Malach et al., 2020), with explicit dependence of required width $k$ on the depth, input dimension, and target accuracy: $k = O(n\,s^3\,l^2/\eta^2 \cdot \log(n s l / \delta)).$ Subsequent work extended the SLTH to architectures including convolutional neural networks, $G$ -equivariant networks, transformers, variational quantum circuits, and quantized networks.

2. Probabilistic and Combinatorial Mechanisms

The core technical tool underlying SLTH results is the Random Subset Sum Problem (RSSP). Central to the proofs is the fact that, given a sufficiently large pool of i.i.d. random weights (e.g., uniform in $[-1,1]$ ), one can select a subset whose sum approximates any target scalar to exponentially small error. This is made quantitative through subset-sum concentration inequalities (e.g., Lueker 1998) and, in the quantized setting, through the phase transition in the Number Partitioning Problem (Borgs, Chayes, Pittel 2001).

For deep networks, these subset-sum constructions are modularly composed to approximate all individual weights of the target network, using multiple randomized layers and block structures. Overparameterization requirements scale logarithmically in inverse error and dimensions, and are near-optimal due to cardinality-based lower bounds.

Recent advances, such as the Random Fixed-Size Subset Sum (RFSS) approach (Natale et al., 18 Oct 2024), provide guarantees not only on the existence but also on the sparsity (exact density) of the selected subnetworks, with tight entropy bounds: $n \geq C\,\frac{\log_2^2(k/\epsilon)}{H_2(k/n)}$ where $H_2$ denotes binary entropy.

3. Extensions to Structured and Equivariant Architectures

The SLTH extends beyond fully connected networks:

$G$ -Equivariant Networks: A unified theoretical framework (Ferbach et al., 2022) establishes that, for any group $G$ and associated equivariant basis, a random $G$ -equivariant network of doubled depth contains, with high probability, a $G$ -equivariant sparse subnetwork approximating any target in the function class. The overparameterization is determined by group-theoretic invariants and basis cardinality.
Transformers and Attention: The SLTH has been proved for multi-head attention (MHA) and transformers without normalization layers (Otsuka et al., 6 Nov 2025). A randomly initialized MHA with per-head key/value dimension $k = O(d\log(H d^{3/2}/\epsilon))$ contains a binary-masked subnetwork approximating any target MHA. The construction leverages two-layers-for-one approximation and softmax Lipschitz control to establish the result uniformly over sequence length.
Quantum Circuits: In the variational quantum circuit (VQC) setting, the SLTH asserts the existence of a sparse subcircuit—obtained by masking parameters at initialization—that achieves performance on par with the full circuit, all without continuous parameter training (Kölle et al., 14 Sep 2025).

4. Quantized, Mixed-Precision, and Practical Constraints

The SLTH for quantized networks (Kumar et al., 14 Aug 2025) demonstrates that, for any $\delta$ -quantized ReLU target network (with $\delta=2^{-b}$ precision), a randomly initialized mixed-precision network of depth $2\ell$ and width $O(d\log(1/\delta))$ per block can be masked and quantized to recover the target function exactly. This matches information-theoretic lower bounds on the number of representable mappings and generalizes the continuous setting, yielding an explicit recipe for constructing low-precision networks via overparameterization and pruning.

Further, the introduction of $\varepsilon$ -scale perturbations (Xiong et al., 2022) shows that allowing each random weight to move within an entrywise $\varepsilon$ -ball around the initialization strictly reduces the required overparameterization. The main technical innovation is an $\varepsilon$ -perturbed subset-sum bound, with the overparameterization requirement scaling as: \begin{align*} K_1 &= O\left( \frac{\log(1/\eta)}{\log(5/4 + \varepsilon/2)} \right),\ K_2 &= O\left( \frac{\log(1/\eta)}{1+\varepsilon} \right) \end{align*} for each block, interpolating between pure pruning ( $\varepsilon=0$ ) and full training ( $\varepsilon\gg 1$ ).

5. Algorithms for Discovering Strong Lottery Tickets

While existence of strong lottery tickets is combinatorially assured, finding an explicit mask is NP-hard. Nevertheless, several search strategies have been empirically validated:

Gradient-based heuristics: Techniques such as Edge-Popup (score-based mask learning) are commonly used to approximate the optimal subnetwork in practice and align well with subset-sum theoretical guarantees in small to medium networks (Xiong et al., 2022, Otsuka et al., 6 Nov 2025).
Genetic and Evolutionary Algorithms: Gradient-free combinatorial search (e.g., steady-state genetic algorithms) efficiently explores the discrete mask space, frequently yielding smaller and better subnetworks than gradient-based approaches on binary tasks (Altmann et al., 7 Nov 2024).
Optimization via Projected SGD: Running SGD with an $\ell_\infty$ -norm projection around initialization can be viewed as realizing the $\varepsilon$ -perturbation SLTH regime and improves lottery ticket test accuracy as $\varepsilon$ increases (up to a plateau), while optimal sparsity of the subnetwork decreases with growing $\varepsilon$ (Xiong et al., 2022).
Exact Integer Programming: Mixed-integer programming solvers (e.g., Gurobi) can be applied to solve small instance subset-sum problems in the pruning of overparameterized layers, supporting the theoretical analysis but limited in scalability (Ferbach et al., 2022).

In quantum circuit settings, the evolutionary algorithm directly masks quantum gates without parameter updates, circumventing issues such as the barren plateau phenomenon (Kölle et al., 14 Sep 2025).

6. Empirical Evidence, Limitations, and Open Problems

Empirical validations confirm that SLTH-pruned networks can match or closely approximate the test accuracy of full, trained target networks across MLPs, CNNs, transformers, and VQCs. For instance, in random four-layer MLPs trained on MNIST, projected-SGD followed by Edge-Popup provides >98% accuracy under mild perturbation ( $\varepsilon\approx 0.2$ ), with increasing $\varepsilon$ yielding higher accuracy at lower pruning rates (Xiong et al., 2022). For quantized networks, exact function reconstruction is achieved for finite-bit targets (Kumar et al., 14 Aug 2025).

However, several open questions remain:

Computational Intractability: While existence of masks is guaranteed, there is no polynomial-time algorithm for mask selection in general, and the search problem remains combinatorial in practice.
Extension to Broader Function Classes: Current theoretical guarantees are clearest for ReLU activations, bounded-weight targets, and uniform or sum-bounded initialization; extensions to smooth activations, unbounded weights, and structural constraints (e.g., convolutions with weight sharing) are ongoing areas of research.
Sparsity–Overparameterization Trade-offs: The RFSS regime quantifies the precise balance, but practical algorithms for achieving theoretical sparsity in high dimensions are lacking (Natale et al., 18 Oct 2024).
Interplay with Training Dynamics: The relationship between SGD trajectories, mode stability, and the emergence of matching strong tickets in large-scale models remains incompletely understood. Mode-connectivity criteria suggest a phase transition during early training, after which strong tickets become stable under SGD noise (Frankle et al., 2019).
Scalability to Multi-class and Large-scale Settings: While binary classification and regression settings have robust results, genetic and score-based approaches degrade in multi-class or large-scale problems absent further regularization or diversity-preserving mechanisms (Altmann et al., 7 Nov 2024).

7. Implications and Future Directions

By placing the ability to approximate any target network within the random initial weights of a sufficiently large overparameterized model, the SLTH shifts the focus of network design and compression from weight optimization to combinatorial selection. This suggests new strategies for constructing efficient, compressed, and even quantized or mixed-precision models by architectural design and one-shot pruning at initialization, with minimal or no training.

Architecturally, the SLTH extends the universality paradigm: sufficiently random, overparameterized neural networks—dense, structured, equivariant, or quantum—are inherently universal approximators of their own subfamilies via masked subnetworks. A plausible implication is that future deep learning paradigms may decouple overparameterization from training budgets, emphasizing masking-discovery algorithms and initializations designed for mask-tuning.

Theoretical advances in subset-sum, fixed-size subset-sum, and number-partitioning provide not only tight estimates on width/depth/sparsity but also direct tools for extending SLTH to layered, quantized, and symmetry-constrained networks. Further development of mask-finding algorithms, especially those leveraging group structure or exploiting combinatorial properties of specific architectures (e.g., Transformers), is a central open direction.

In summary, the Strong Lottery Ticket Hypothesis delineates a rigorous, quantitative regime in which the search for efficient representations in neural networks is achieved entirely via initialization and combinatorial selection, bridging theoretical universality with practical network design and capacity control.