Concrete Ticket Search (CTS)
- Concrete Ticket Search (CTS) is an algorithm that uses a continuous Concrete relaxation to identify sparse, high-performing subnetworks in overparameterized neural networks.
- It leverages an adaptive GradBalance scheme and reverse KL divergence to enforce precise sparsity and maintain the original network's training dynamics.
- CTS achieves notable computational speedups (up to 12× faster) while reliably passing sanity checks compared to traditional lottery ticket and pruning methods.
Concrete Ticket Search (CTS) is an algorithm for discovering highly sparse, trainable subnetworks (so-called "winning tickets") within overparameterized neural networks at or near initialization. Motivated by the limitations of both Lottery Ticket Rewinding (LTR) and saliency-based Pruning-at-Initialization (PaI) methods, CTS frames ticket search as a combinatorial optimization over binary masks and leverages a continuous, low-variance relaxation using Concrete (Gumbel-softmax) distributions. Combined with an adaptive gradient-balancing scheme (GradBalance) for rigorous sparsity control, CTS efficiently produces high-performing subnetworks that robustly pass established sanity checks and match or exceed LTR accuracy while requiring orders-of-magnitude less computation (Arora et al., 8 Dec 2025).
1. Formalization of Subnetwork Search
The core objective of CTS is to find a binary mask for a neural network , such that exactly parameters (out of total) are retained, with the target density. The discrete combinatorial problem is:
where may be the standard task loss or auxiliary objectives reflecting preserved training dynamics. As directly optimizing over masks is intractable, CTS proposes a continuous relaxation.
2. Concrete Relaxation and Optimization
CTS introduces retention probabilities and models each mask entry as , leading to the relaxed problem:
To backpropagate efficiently through discrete samples, CTS adopts the Binary Concrete (Gumbel-softmax) reparameterization. Each “soft mask” variable is obtained as:
with the sigmoid, and the Concrete temperature. At lower , concentrates near , closely reflecting hard masks. At the end of optimization, the top- elements of are set to $1$ to form the final deterministic mask.
3. GradBalance: Adaptive Sparsity Enforcement
Naive Lagrangian approaches for enforcing the mask density constraint are prone to instability and require manual tuning. CTS introduces the “GradBalance” scheme, which adaptively scales the gradient of the sparsity constraint to balance the magnitude of the objective gradient.
Let the normalized sparsity constraint be
Set (objective gradient), (constraint gradient). If the mask is too dense, set the multiplier by scaling to match and , with smoothing. The update step becomes
This adaptive approach guarantees mask density without sensitive hyperparameter selection.
4. CTS Objectives: Reverse KL and Other Distillation-Based Losses
While can be the task loss, CTS empirically benefits from objectives designed to preserve training dynamics. The reverse Kullback–Leibler (KL) divergence between the output of the sparse “ticket” network and the original dense network,
where and , encourages the mask to preserve initialization-time functional behavior. Other applicable objectives include:
- Relative loss change (SNIP)
- Negative gradient norm (GraSP)
- Feature-map matching
- Gradient-matching (first-step direction)
Among these, reverse KL and task loss lead to subnetworks that are effective at very high sparsities.
5. Algorithmic Framework and Computational Characteristics
The CTS algorithm proceeds as follows:
- Initial Training: Train for steps to obtain .
- Optimization of Mask Probabilities: Freeze , initialize to select density , and iteratively update using the Concrete relaxation and GradBalance over ticket-search epochs.
- Mask Finalization: Select top- entries of to form the deterministic subnetwork.
- Final Training: Train for the remaining steps.
Key hyperparameters include Concrete temperature , learning rate for , and smoothing . CTS requires no manual selection of Lagrange multipliers and has no sensitive ticket-search or constraint hyperparameters.
In computational terms, CTS eliminates the need for prune–retrain cycles (each a full or partial retrain) as required by LTR. Ticket search proceeds over a frozen network; e.g., on ResNet-20 (CIFAR-10), 7.9 minutes are sufficient for 99.3% sparsity with 74.0% accuracy, compared to 95.2 minutes for LTR with 68.3% accuracy. On ImageNet with ResNet-50, CTS achieves ∼12x speedup with higher accuracy in the 99% sparse regime.
| Method | Compute (epochs) | Test Acc. (%) | Sanity Checks Passed |
|---|---|---|---|
| LTR | 3058 | 80.90 | ✔ |
| SNIP | 160 | 67.73 | ✗ |
| GraSP | 160 | 62.59 | ✗ |
| SynFlow | 161 | 70.18 | ✗ |
| Gem-Miner | 320 | 77.89 | ✔ |
| Quick CTS | 180 | 79.04 | ✔ |
| CTS | 320 | 80.26 | ✔ |
6. Empirical Results and Sanity Checks
CTS was evaluated on CIFAR-10 (ResNet-20, VGG-16) and ImageNet (ResNet-50). In high-sparsity regimes (95% sparsity, up to 99.8%), CTS consistently outperformed both LTR and all tested PaI methods in terms of accuracy, computational efficiency, and robustness to established ablations.
Following standard sanity checks, valid tickets should fail if their mask is shuffled by layer, mask selection scores are inverted, or kept weights are re-initialized. CTS passes all such checks, with accuracy dropping markedly under ablations—a contrast to PaI methods, which often do not. Layerwise analysis shows that CTS (like LTR) preserves density in first/last layers while heavily pruning intermediate layers, while saliency-based PaI often collapses critical layers.
CTS tickets closely match the logit trajectories of the dense parent under reverse KL, and retained subnetworks follow the same early feature-map and gradient-norm dynamics as their dense ancestors. This demonstrates effective preservation of training dynamics, pivotal for high sparse-mode accuracy.
7. Practical Implications and Significance
CTS reframes lottery ticket discovery as a single, continuous optimization, combining (i) a low-variance Concrete relaxation of the mask search, (ii) an adaptive GradBalance scheme for precise sparsity, and (iii) training-dynamic-inspired knowledge distillation objectives. The method achieves (1) high-quality subnetworks that match or surpass LTR in the highly sparse regime; (2) compliance with all established winning ticket sanity checks; and (3) a dramatic reduction in computational burden—typically, a – speedup at high sparsity. This suggests CTS is a robust and efficient approach for sparse subnetwork extraction, particularly in cases where computational efficiency and verification of winning ticket properties are essential (Arora et al., 8 Dec 2025).