Lottery Ticket Hypothesis in Deep Learning

Updated 3 January 2026

Lottery Ticket Hypothesis is a concept that reveals a highly sparse subnetwork within a dense model can be isolated and trained from its initialization to achieve similar performance.
The iterative magnitude pruning (IMP) algorithm underpins this method by progressively removing low-magnitude weights and rewinding survivors to early initialization.
Research in LTH spans theoretical proofs, algorithmic innovations, and applications across CNNs, transformers, spiking neural networks, and quantum circuits.

The Lottery Ticket Hypothesis (LTH) posits that within a randomly initialized dense neural network there exists a highly sparse subnetwork—a "winning ticket"—that, when trained in isolation from the same or an early-stage initialization, can achieve accuracy on par with the original dense network. Over the past several years, LTH has evolved into a research program spanning empirical validation, formal theory, algorithmic innovations, and diverse application domains, including CNNs, transformers, spiking neural networks, variational quantum circuits, and generative models (Liu et al., 2024).

1. Formal Definition and Central Mechanisms

The canonical definition—due to Frankle and Carbin (2019)—asserts that for a dense network $f(x;\theta_0)$ with random initialization $\theta_0 \in \mathbb{R}^d$ , there exists a binary mask $m \in \{0,1\}^d$ such that training only the surviving parameters, i.e., performing SGD on $f(x; m \odot \theta_0)$ , yields a solution with generalization performance comparable to the full model. The mask $m$ specifies which weights are retained, with total sparsity $s = 1 - \|m\|_0 / d$ at the desired compression level (Liu et al., 2024).

The most widely used algorithm to discover such winning tickets is Iterative Magnitude Pruning (IMP):

IMP Algorithm:

Train the dense network to completion from initialization $\theta_0$ .
Prune the smallest-magnitude $p\%$ weights (globally or per-layer).
Rewind surviving weights to $\theta_0$ or an early checkpoint.
Retrain the resulting sparse network; repeat steps 2–4 until target sparsity is reached (Liu et al., 2024).

Variants include layer-wise importance metrics instead of global magnitude (Vandersmissen et al., 2023), structured or hardware-friendly masks, and continuous sparsification via learnable scores (Liu et al., 2024).

2. Theoretical Foundations

2.1 Existence Results

Theoretical work has established that for sufficiently overparameterized ReLU networks, with width scaling as $O(W\log(W/\epsilon))$ , a random network contains a subnetwork (found by pruning) that can approximate any target network to within $\theta_0 \in \mathbb{R}^d$ 0 error without further weight updates (Liu et al., 2024). Extensions replace the Lipschitz assumption (valid for ReLU) with probabilistic methods for discontinuous or spatio-temporal settings such as SNNs (Yao et al., 2023).

2.2 Analytical Frameworks

A PAC-Bayesian analysis with spike-and-slab priors elucidates the tradeoff between sharpness (loss landscape flatness) and proximity to initialization. This framework rationalizes why magnitude-pruned winning tickets generalize well only within a specific regime—too sharp minima (small LR) reduce robustness, while too far from initialization increases complexity penalties in the bound (Sakamoto et al., 2022).

2.3 Strong Lottery Ticket Hypothesis

The strong LTH asserts that "without any training," there exists a pruning mask such that the initial random network, when pruned, approximates a target function—a probabilistic subset-sum construction formalizes this for both continuous activations and, via new analysis, for spiking neural networks (Xiong et al., 2022, Yao et al., 2023). Allowing small-scale $\theta_0 \in \mathbb{R}^d$ 1-perturbations of the initialization enables considerable reductions in the required overparameterization (Xiong et al., 2022).

3. Algorithmic Innovations and Extensions

3.1 Layerwise Importance and Non-Uniqueness

Metrics beyond global magnitude—such as per-layer $\theta_0 \in \mathbb{R}^d$ 2, $\theta_0 \in \mathbb{R}^d$ 3, softmax, or min-max normalization—can yield high-quality lottery tickets with little weight overlap, demonstrating non-uniqueness in mask selection and highlighting the existence of a “core” of stable, critical connections (Vandersmissen et al., 2023).

3.2 Dual and Elastic LTH

The Dual Lottery Ticket Hypothesis (DLTH) demonstrates that uniform-randomly selected subnetworks can be transformed into performant tickets by a "Random Sparse Network Transformation"—a sparsity-aware $\theta_0 \in \mathbb{R}^d$ 4 regularization that extrudes information from masked weights prior to pruning (Bai et al., 2022). The Elastic LTH (E-LTH) enables transfer of winning tickets across architectures by stretching, squeezing, and reordering subnetworks, exploiting residual and block-wise structural symmetries (Chen et al., 2021).

3.3 Permutation Invariance and Alignment

A major limitation of standard LTH is that a mask found on one initialization typically fails on a new one. Permuting the mask in accordance with activation correlations (weight symmetry alignment) enables reuse of precomputed tickets across random initializations, matching the basin of the new weight vector and restoring sparse trainability to near-baseline levels (Adnan et al., 8 May 2025).

3.4 Signs and Normalization

Recent advances demonstrate that the parameter sign configuration—rather than fine magnitude—along with normalization parameters, encodes the crucial information for generalization in a mask. New algorithms (e.g., AWS) commit the sign of weights into the transferable structure, enabling one-shot, reusable tickets independent of explicit initialization (Oh et al., 7 Apr 2025).

4. Empirical Evaluations and Special Domains

4.1 Supervised Learning

IMP reliably finds tickets at $\theta_0 \in \mathbb{R}^d$ 580–99% sparsity in MNIST, CIFAR-10/100, TinyImageNet, and ImageNet, typically with only a minor test-accuracy drop for moderate compression. Structured pruning and data-efficient variants (early-bird, GraSP) accelerate mask discovery (Liu et al., 2024).

4.2 Generative and Structured Models

Applied to Denoising Diffusion Probabilistic Models (DDPMs), LTH yields subnetworks at 90–99% sparsity without sample quality degradation (measured by FID), with up to $\theta_0 \in \mathbb{R}^d$ 6 reduction in parameter count and $\theta_0 \in \mathbb{R}^d$ 790% savings in FLOPs (Jiang et al., 2023). Layerwise-varying sparsity—allocating more weights to upstream blocks—preserves fidelity more effectively than uniform pruning at extreme sparsities.

4.3 Spiking Neural Networks and Quantum Circuits

LTH naturally extends to deep SNNs: IMP locates winning tickets up to $\theta_0 \in \mathbb{R}^d$ 897% sparsity, and new pruning criteria (based on probabilistic influence rather than magnitude) yield improved performance (Kim et al., 2022, Yao et al., 2023). For variational quantum circuits, both weak and strong LTH are observed: subnetworks found by iterative pruning retain 26–45% of parameters and reach baseline accuracy, and pruning mitigates barren plateaus (Kölle et al., 14 Sep 2025).

4.4 Transformers and Data-Level Sparsification

In vision transformers, conventional weight-level LTH is difficult; instead, the "Data-Level Lottery Ticket Hypothesis" identifies a highly informative subset of input patches (per example) such that performance matches or exceeds using dense inputs, supporting a generalization of LTH to data-modality sparsification (Shen et al., 2022).

5. Interpretability, Generalization, and Explainability

Lottery tickets, at moderate sparsity, not only reproduce the original model’s predictive accuracy but also its explanatory features, as assessed by Grad-CAM and concept bottleneck analyses. At very high sparsities (>70%), both predictive and explanatory consistency degrade: models may attend to spurious pixels or lose reliance on domain-relevant concepts (Ghosh et al., 2023). Explainability-aware regularization is suggested to maintain interpretation consistency when aggressively pruning.

PAC-Bayes analysis provides a unified explanation: winning tickets must locate a minimum that is flat (insensitive to perturbation in unpruned subspace) and not too distant from initialization, promoting both accuracy and robustness (Sakamoto et al., 2022).

6. Open Challenges and Future Directions

Significant issues remain regarding hardware efficiency—unstructured masks are ill-matched to current accelerators, driving interest in structured mask and co-design. Theoretical understanding of LTH for dynamic sparsity, large transformers, and fully diffusion-based models remains incomplete. Empirical reproducibility is hindered by inconsistent experimental protocols; community benchmarks and open frameworks are needed for progress (Liu et al., 2024). Questions of uniqueness (“is there one winning ticket?”), universal transferability, and the interaction with data and optimization structures are active topics.

Finally, the evolution of LTH encompasses not only a practical recipe for network compression, but also a broader inquiry into the geometry, symmetry, and criticality of sparse solutions in high-dimensional loss landscapes. The field continues to expand into new architectures and paradigms, probing the fundamental relationship between architecture, initialization, optimization, and generalization in sparse neural computation.