Concrete Ticket Search (CTS)

Updated 15 December 2025

Concrete Ticket Search (CTS) is an algorithm that uses a continuous Concrete relaxation to identify sparse, high-performing subnetworks in overparameterized neural networks.
It leverages an adaptive GradBalance scheme and reverse KL divergence to enforce precise sparsity and maintain the original network's training dynamics.
CTS achieves notable computational speedups (up to 12× faster) while reliably passing sanity checks compared to traditional lottery ticket and pruning methods.

Concrete Ticket Search (CTS) is an algorithm for discovering highly sparse, trainable subnetworks (so-called "winning tickets") within overparameterized neural networks at or near initialization. Motivated by the limitations of both Lottery Ticket Rewinding (LTR) and saliency-based Pruning-at-Initialization (PaI) methods, CTS frames ticket search as a combinatorial optimization over binary masks and leverages a continuous, low-variance relaxation using Concrete (Gumbel-softmax) distributions. Combined with an adaptive gradient-balancing scheme (GradBalance) for rigorous sparsity control, CTS efficiently produces high-performing subnetworks that robustly pass established sanity checks and match or exceed LTR accuracy while requiring orders-of-magnitude less computation (Arora et al., 8 Dec 2025).

1. Formalization of Subnetwork Search

The core objective of CTS is to find a binary mask $m \in \{0,1\}^d$ for a neural network $f(x;\theta)$ , such that exactly $\kappa d$ parameters (out of $d$ total) are retained, with $\kappa\in(0,1]$ the target density. The discrete combinatorial problem is:

$\min_{m\in\{0,1\}^d} \ \mathbb{E}_{(x, y)\sim\mathcal{D}} \left[\mathcal{R}\big(f(x;\, m \odot \theta),\, y\big)\right] \quad \text{s.t.} \quad \|m\|_0 = \kappa d \quad\quad (1)$

where $\mathcal{R}$ may be the standard task loss or auxiliary objectives reflecting preserved training dynamics. As directly optimizing over $\binom{d}{\kappa d}$ masks is intractable, CTS proposes a continuous relaxation.

2. Concrete Relaxation and Optimization

CTS introduces retention probabilities $\alpha \in [0,1]^d$ and models each mask entry as $m_j \sim \text{Bernoulli}(\alpha_j)$ , leading to the relaxed problem:

$\min_{\alpha \in [0,1]^d} \ \mathbb{E}_{m\sim\text{Bern}(\alpha)}\left[\mathcal{R}(f(x; m \odot \theta), y)\right] \quad \text{s.t.} \quad \frac{1}{d}\sum_{j=1}^d \alpha_j \approx \kappa \quad\quad (2)$

To backpropagate efficiently through discrete samples, CTS adopts the Binary Concrete (Gumbel-softmax) reparameterization. Each “soft mask” variable $s_j$ is obtained as:

$s_j = \sigma\left(\frac{\alpha_{\text{logit},j} + \varepsilon_j}{\tau}\right), \quad \varepsilon_j \sim \text{Logistic}(0,1), \ \tau>0$

with $\sigma(z)$ the sigmoid, and $\tau$ the Concrete temperature. At lower $\tau$ , $s_j$ concentrates near $\{0,1\}$ , closely reflecting hard masks. At the end of optimization, the top- $(\kappa d)$ elements of $\alpha$ are set to $1$ to form the final deterministic mask.

3. GradBalance: Adaptive Sparsity Enforcement

Naive Lagrangian approaches for enforcing the mask density constraint are prone to instability and require manual tuning. CTS introduces the “GradBalance” scheme, which adaptively scales the gradient of the sparsity constraint to balance the magnitude of the objective gradient.

Let the normalized sparsity constraint be

$\mathcal{L}_{\rm sparsity} = \frac{1}{\kappa d} \sum_j \alpha_j - 1$

Set $g_{\text{obj}} = \nabla_{\alpha_{\text{logit}}}\mathcal{R}$ (objective gradient), $g_{\text{spa}} = \nabla_{\alpha_{\text{logit}}}\mathcal{L}_{\rm sparsity}$ (constraint gradient). If the mask is too dense, set the multiplier $\lambda$ by scaling to match $\|g_{\text{obj}}\|_2$ and $\|g_{\text{spa}}\|_2$ , with smoothing. The update step becomes

$g_{\alpha} = g_{\text{obj}} + \lambda\, g_{\text{spa}}$

This adaptive approach guarantees mask density without sensitive hyperparameter selection.

4. CTS Objectives: Reverse KL and Other Distillation-Based Losses

While $\mathcal{R}$ can be the task loss, CTS empirically benefits from objectives designed to preserve training dynamics. The reverse Kullback–Leibler (KL) divergence between the output of the sparse “ticket” network and the original dense network,

$\mathcal{R}_{\rm KL} = D_{\rm KL}\big(p_{\rm sparse}\;\|\;p_{\rm dense}\big)$

where $p_{\rm sparse} = \mathrm{softmax}(f(x; s\odot\theta))$ and $p_{\rm dense} = \mathrm{softmax}(f(x; \theta))$ , encourages the mask to preserve initialization-time functional behavior. Other applicable objectives include:

Relative loss change (SNIP)
Negative gradient norm (GraSP)
Feature-map matching
Gradient-matching (first-step direction)

Among these, reverse KL and task loss lead to subnetworks that are effective at very high sparsities.

5. Algorithmic Framework and Computational Characteristics

The CTS algorithm proceeds as follows:

Initial Training: Train $f(x;\theta_0)$ for $k$ steps to obtain $\theta_k$ .
Optimization of Mask Probabilities: Freeze $\theta_k$ , initialize $\alpha_{\text{logit}}$ to select density $\kappa$ , and iteratively update $\alpha_{\text{logit}}$ using the Concrete relaxation and GradBalance over $S$ ticket-search epochs.
Mask Finalization: Select top- $\kappa d$ entries of $\alpha$ to form the deterministic subnetwork.
Final Training: Train $f(x;m\odot\theta_k)$ for the remaining $(T-k)$ steps.

Key hyperparameters include Concrete temperature $\tau=2/3$ , learning rate for $\alpha$ , and smoothing $\eta=0.9$ . CTS requires no manual selection of Lagrange multipliers and has no sensitive ticket-search or constraint hyperparameters.

In computational terms, CTS eliminates the need for $\sim 20$ prune–retrain cycles (each a full or partial retrain) as required by LTR. Ticket search proceeds over a frozen network; e.g., on ResNet-20 (CIFAR-10), 7.9 minutes are sufficient for 99.3% sparsity with 74.0% accuracy, compared to 95.2 minutes for LTR with 68.3% accuracy. On ImageNet with ResNet-50, CTS achieves ∼12x speedup with higher accuracy in the 99% sparse regime.

Method	Compute (epochs)	Test Acc. (%)	Sanity Checks Passed
LTR	3058	80.90	✔
SNIP	160	67.73	✗
GraSP	160	62.59	✗
SynFlow	161	70.18	✗
Gem-Miner	320	77.89	✔
Quick CTS $_\mathrm{KL}$	180	79.04	✔
CTS $_\mathrm{KL}$	320	80.26	✔

6. Empirical Results and Sanity Checks

CTS was evaluated on CIFAR-10 (ResNet-20, VGG-16) and ImageNet (ResNet-50). In high-sparsity regimes ( $>$ 95% sparsity, up to 99.8%), CTS $_\mathrm{KL}$ consistently outperformed both LTR and all tested PaI methods in terms of accuracy, computational efficiency, and robustness to established ablations.

Following standard sanity checks, valid tickets should fail if their mask is shuffled by layer, mask selection scores are inverted, or kept weights are re-initialized. CTS passes all such checks, with accuracy dropping markedly under ablations—a contrast to PaI methods, which often do not. Layerwise analysis shows that CTS (like LTR) preserves density in first/last layers while heavily pruning intermediate layers, while saliency-based PaI often collapses critical layers.

CTS tickets closely match the logit trajectories of the dense parent under reverse KL, and retained subnetworks follow the same early feature-map and gradient-norm dynamics as their dense ancestors. This demonstrates effective preservation of training dynamics, pivotal for high sparse-mode accuracy.

7. Practical Implications and Significance

CTS reframes lottery ticket discovery as a single, continuous optimization, combining (i) a low-variance Concrete relaxation of the mask search, (ii) an adaptive GradBalance scheme for precise sparsity, and (iii) training-dynamic-inspired knowledge distillation objectives. The method achieves (1) high-quality subnetworks that match or surpass LTR in the highly sparse regime; (2) compliance with all established winning ticket sanity checks; and (3) a dramatic reduction in computational burden—typically, a $4\times$ – $12\times$ speedup at high sparsity. This suggests CTS is a robust and efficient approach for sparse subnetwork extraction, particularly in cases where computational efficiency and verification of winning ticket properties are essential (Arora et al., 8 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Winning the Lottery by Preserving Network Training Dynamics with Concrete Ticket Search (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Concrete Ticket Search (CTS).