Papers
Topics
Authors
Recent
2000 character limit reached

Concrete Ticket Search (CTS)

Updated 15 December 2025
  • Concrete Ticket Search (CTS) is an algorithm that uses a continuous Concrete relaxation to identify sparse, high-performing subnetworks in overparameterized neural networks.
  • It leverages an adaptive GradBalance scheme and reverse KL divergence to enforce precise sparsity and maintain the original network's training dynamics.
  • CTS achieves notable computational speedups (up to 12× faster) while reliably passing sanity checks compared to traditional lottery ticket and pruning methods.

Concrete Ticket Search (CTS) is an algorithm for discovering highly sparse, trainable subnetworks (so-called "winning tickets") within overparameterized neural networks at or near initialization. Motivated by the limitations of both Lottery Ticket Rewinding (LTR) and saliency-based Pruning-at-Initialization (PaI) methods, CTS frames ticket search as a combinatorial optimization over binary masks and leverages a continuous, low-variance relaxation using Concrete (Gumbel-softmax) distributions. Combined with an adaptive gradient-balancing scheme (GradBalance) for rigorous sparsity control, CTS efficiently produces high-performing subnetworks that robustly pass established sanity checks and match or exceed LTR accuracy while requiring orders-of-magnitude less computation (Arora et al., 8 Dec 2025).

The core objective of CTS is to find a binary mask m{0,1}dm \in \{0,1\}^d for a neural network f(x;θ)f(x;\theta), such that exactly κd\kappa d parameters (out of dd total) are retained, with κ(0,1]\kappa\in(0,1] the target density. The discrete combinatorial problem is:

minm{0,1}d E(x,y)D[R(f(x;mθ),y)]s.t.m0=κd(1)\min_{m\in\{0,1\}^d} \ \mathbb{E}_{(x, y)\sim\mathcal{D}} \left[\mathcal{R}\big(f(x;\, m \odot \theta),\, y\big)\right] \quad \text{s.t.} \quad \|m\|_0 = \kappa d \quad\quad (1)

where R\mathcal{R} may be the standard task loss or auxiliary objectives reflecting preserved training dynamics. As directly optimizing over (dκd)\binom{d}{\kappa d} masks is intractable, CTS proposes a continuous relaxation.

2. Concrete Relaxation and Optimization

CTS introduces retention probabilities α[0,1]d\alpha \in [0,1]^d and models each mask entry as mjBernoulli(αj)m_j \sim \text{Bernoulli}(\alpha_j), leading to the relaxed problem:

minα[0,1]d EmBern(α)[R(f(x;mθ),y)]s.t.1dj=1dαjκ(2)\min_{\alpha \in [0,1]^d} \ \mathbb{E}_{m\sim\text{Bern}(\alpha)}\left[\mathcal{R}(f(x; m \odot \theta), y)\right] \quad \text{s.t.} \quad \frac{1}{d}\sum_{j=1}^d \alpha_j \approx \kappa \quad\quad (2)

To backpropagate efficiently through discrete samples, CTS adopts the Binary Concrete (Gumbel-softmax) reparameterization. Each “soft mask” variable sjs_j is obtained as:

sj=σ(αlogit,j+εjτ),εjLogistic(0,1), τ>0s_j = \sigma\left(\frac{\alpha_{\text{logit},j} + \varepsilon_j}{\tau}\right), \quad \varepsilon_j \sim \text{Logistic}(0,1), \ \tau>0

with σ(z)\sigma(z) the sigmoid, and τ\tau the Concrete temperature. At lower τ\tau, sjs_j concentrates near {0,1}\{0,1\}, closely reflecting hard masks. At the end of optimization, the top-(κd)(\kappa d) elements of α\alpha are set to $1$ to form the final deterministic mask.

3. GradBalance: Adaptive Sparsity Enforcement

Naive Lagrangian approaches for enforcing the mask density constraint are prone to instability and require manual tuning. CTS introduces the “GradBalance” scheme, which adaptively scales the gradient of the sparsity constraint to balance the magnitude of the objective gradient.

Let the normalized sparsity constraint be

Lsparsity=1κdjαj1\mathcal{L}_{\rm sparsity} = \frac{1}{\kappa d} \sum_j \alpha_j - 1

Set gobj=αlogitRg_{\text{obj}} = \nabla_{\alpha_{\text{logit}}}\mathcal{R} (objective gradient), gspa=αlogitLsparsityg_{\text{spa}} = \nabla_{\alpha_{\text{logit}}}\mathcal{L}_{\rm sparsity} (constraint gradient). If the mask is too dense, set the multiplier λ\lambda by scaling to match gobj2\|g_{\text{obj}}\|_2 and gspa2\|g_{\text{spa}}\|_2, with smoothing. The update step becomes

gα=gobj+λgspag_{\alpha} = g_{\text{obj}} + \lambda\, g_{\text{spa}}

This adaptive approach guarantees mask density without sensitive hyperparameter selection.

4. CTS Objectives: Reverse KL and Other Distillation-Based Losses

While R\mathcal{R} can be the task loss, CTS empirically benefits from objectives designed to preserve training dynamics. The reverse Kullback–Leibler (KL) divergence between the output of the sparse “ticket” network and the original dense network,

RKL=DKL(psparse    pdense)\mathcal{R}_{\rm KL} = D_{\rm KL}\big(p_{\rm sparse}\;\|\;p_{\rm dense}\big)

where psparse=softmax(f(x;sθ))p_{\rm sparse} = \mathrm{softmax}(f(x; s\odot\theta)) and pdense=softmax(f(x;θ))p_{\rm dense} = \mathrm{softmax}(f(x; \theta)), encourages the mask to preserve initialization-time functional behavior. Other applicable objectives include:

  • Relative loss change (SNIP)
  • Negative gradient norm (GraSP)
  • Feature-map matching
  • Gradient-matching (first-step direction)

Among these, reverse KL and task loss lead to subnetworks that are effective at very high sparsities.

5. Algorithmic Framework and Computational Characteristics

The CTS algorithm proceeds as follows:

  1. Initial Training: Train f(x;θ0)f(x;\theta_0) for kk steps to obtain θk\theta_k.
  2. Optimization of Mask Probabilities: Freeze θk\theta_k, initialize αlogit\alpha_{\text{logit}} to select density κ\kappa, and iteratively update αlogit\alpha_{\text{logit}} using the Concrete relaxation and GradBalance over SS ticket-search epochs.
  3. Mask Finalization: Select top-κd\kappa d entries of α\alpha to form the deterministic subnetwork.
  4. Final Training: Train f(x;mθk)f(x;m\odot\theta_k) for the remaining (Tk)(T-k) steps.

Key hyperparameters include Concrete temperature τ=2/3\tau=2/3, learning rate for α\alpha, and smoothing η=0.9\eta=0.9. CTS requires no manual selection of Lagrange multipliers and has no sensitive ticket-search or constraint hyperparameters.

In computational terms, CTS eliminates the need for 20\sim 20 prune–retrain cycles (each a full or partial retrain) as required by LTR. Ticket search proceeds over a frozen network; e.g., on ResNet-20 (CIFAR-10), 7.9 minutes are sufficient for 99.3% sparsity with 74.0% accuracy, compared to 95.2 minutes for LTR with 68.3% accuracy. On ImageNet with ResNet-50, CTS achieves ∼12x speedup with higher accuracy in the 99% sparse regime.

Method Compute (epochs) Test Acc. (%) Sanity Checks Passed
LTR 3058 80.90
SNIP 160 67.73
GraSP 160 62.59
SynFlow 161 70.18
Gem-Miner 320 77.89
Quick CTSKL_\mathrm{KL} 180 79.04
CTSKL_\mathrm{KL} 320 80.26

6. Empirical Results and Sanity Checks

CTS was evaluated on CIFAR-10 (ResNet-20, VGG-16) and ImageNet (ResNet-50). In high-sparsity regimes (>>95% sparsity, up to 99.8%), CTSKL_\mathrm{KL} consistently outperformed both LTR and all tested PaI methods in terms of accuracy, computational efficiency, and robustness to established ablations.

Following standard sanity checks, valid tickets should fail if their mask is shuffled by layer, mask selection scores are inverted, or kept weights are re-initialized. CTS passes all such checks, with accuracy dropping markedly under ablations—a contrast to PaI methods, which often do not. Layerwise analysis shows that CTS (like LTR) preserves density in first/last layers while heavily pruning intermediate layers, while saliency-based PaI often collapses critical layers.

CTS tickets closely match the logit trajectories of the dense parent under reverse KL, and retained subnetworks follow the same early feature-map and gradient-norm dynamics as their dense ancestors. This demonstrates effective preservation of training dynamics, pivotal for high sparse-mode accuracy.

7. Practical Implications and Significance

CTS reframes lottery ticket discovery as a single, continuous optimization, combining (i) a low-variance Concrete relaxation of the mask search, (ii) an adaptive GradBalance scheme for precise sparsity, and (iii) training-dynamic-inspired knowledge distillation objectives. The method achieves (1) high-quality subnetworks that match or surpass LTR in the highly sparse regime; (2) compliance with all established winning ticket sanity checks; and (3) a dramatic reduction in computational burden—typically, a 4×4\times12×12\times speedup at high sparsity. This suggests CTS is a robust and efficient approach for sparse subnetwork extraction, particularly in cases where computational efficiency and verification of winning ticket properties are essential (Arora et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Concrete Ticket Search (CTS).