Papers
Topics
Authors
Recent
Search
2000 character limit reached

Concrete Ticket Search (CTS)

Updated 15 December 2025
  • Concrete Ticket Search (CTS) is an algorithm that uses a continuous Concrete relaxation to identify sparse, high-performing subnetworks in overparameterized neural networks.
  • It leverages an adaptive GradBalance scheme and reverse KL divergence to enforce precise sparsity and maintain the original network's training dynamics.
  • CTS achieves notable computational speedups (up to 12× faster) while reliably passing sanity checks compared to traditional lottery ticket and pruning methods.

Concrete Ticket Search (CTS) is an algorithm for discovering highly sparse, trainable subnetworks (so-called "winning tickets") within overparameterized neural networks at or near initialization. Motivated by the limitations of both Lottery Ticket Rewinding (LTR) and saliency-based Pruning-at-Initialization (PaI) methods, CTS frames ticket search as a combinatorial optimization over binary masks and leverages a continuous, low-variance relaxation using Concrete (Gumbel-softmax) distributions. Combined with an adaptive gradient-balancing scheme (GradBalance) for rigorous sparsity control, CTS efficiently produces high-performing subnetworks that robustly pass established sanity checks and match or exceed LTR accuracy while requiring orders-of-magnitude less computation (Arora et al., 8 Dec 2025).

The core objective of CTS is to find a binary mask m{0,1}dm \in \{0,1\}^d for a neural network f(x;θ)f(x;\theta), such that exactly κd\kappa d parameters (out of dd total) are retained, with κ(0,1]\kappa\in(0,1] the target density. The discrete combinatorial problem is:

minm{0,1}d E(x,y)D[R(f(x;mθ),y)]s.t.m0=κd(1)\min_{m\in\{0,1\}^d} \ \mathbb{E}_{(x, y)\sim\mathcal{D}} \left[\mathcal{R}\big(f(x;\, m \odot \theta),\, y\big)\right] \quad \text{s.t.} \quad \|m\|_0 = \kappa d \quad\quad (1)

where R\mathcal{R} may be the standard task loss or auxiliary objectives reflecting preserved training dynamics. As directly optimizing over (dκd)\binom{d}{\kappa d} masks is intractable, CTS proposes a continuous relaxation.

2. Concrete Relaxation and Optimization

CTS introduces retention probabilities α[0,1]d\alpha \in [0,1]^d and models each mask entry as mjBernoulli(αj)m_j \sim \text{Bernoulli}(\alpha_j), leading to the relaxed problem:

f(x;θ)f(x;\theta)0

To backpropagate efficiently through discrete samples, CTS adopts the Binary Concrete (Gumbel-softmax) reparameterization. Each “soft mask” variable f(x;θ)f(x;\theta)1 is obtained as:

f(x;θ)f(x;\theta)2

with f(x;θ)f(x;\theta)3 the sigmoid, and f(x;θ)f(x;\theta)4 the Concrete temperature. At lower f(x;θ)f(x;\theta)5, f(x;θ)f(x;\theta)6 concentrates near f(x;θ)f(x;\theta)7, closely reflecting hard masks. At the end of optimization, the top-f(x;θ)f(x;\theta)8 elements of f(x;θ)f(x;\theta)9 are set to κd\kappa d0 to form the final deterministic mask.

3. GradBalance: Adaptive Sparsity Enforcement

Naive Lagrangian approaches for enforcing the mask density constraint are prone to instability and require manual tuning. CTS introduces the “GradBalance” scheme, which adaptively scales the gradient of the sparsity constraint to balance the magnitude of the objective gradient.

Let the normalized sparsity constraint be

κd\kappa d1

Set κd\kappa d2 (objective gradient), κd\kappa d3 (constraint gradient). If the mask is too dense, set the multiplier κd\kappa d4 by scaling to match κd\kappa d5 and κd\kappa d6, with smoothing. The update step becomes

κd\kappa d7

This adaptive approach guarantees mask density without sensitive hyperparameter selection.

4. CTS Objectives: Reverse KL and Other Distillation-Based Losses

While κd\kappa d8 can be the task loss, CTS empirically benefits from objectives designed to preserve training dynamics. The reverse Kullback–Leibler (KL) divergence between the output of the sparse “ticket” network and the original dense network,

κd\kappa d9

where dd0 and dd1, encourages the mask to preserve initialization-time functional behavior. Other applicable objectives include:

  • Relative loss change (SNIP)
  • Negative gradient norm (GraSP)
  • Feature-map matching
  • Gradient-matching (first-step direction)

Among these, reverse KL and task loss lead to subnetworks that are effective at very high sparsities.

5. Algorithmic Framework and Computational Characteristics

The CTS algorithm proceeds as follows:

  1. Initial Training: Train dd2 for dd3 steps to obtain dd4.
  2. Optimization of Mask Probabilities: Freeze dd5, initialize dd6 to select density dd7, and iteratively update dd8 using the Concrete relaxation and GradBalance over dd9 ticket-search epochs.
  3. Mask Finalization: Select top-κ(0,1]\kappa\in(0,1]0 entries of κ(0,1]\kappa\in(0,1]1 to form the deterministic subnetwork.
  4. Final Training: Train κ(0,1]\kappa\in(0,1]2 for the remaining κ(0,1]\kappa\in(0,1]3 steps.

Key hyperparameters include Concrete temperature κ(0,1]\kappa\in(0,1]4, learning rate for κ(0,1]\kappa\in(0,1]5, and smoothing κ(0,1]\kappa\in(0,1]6. CTS requires no manual selection of Lagrange multipliers and has no sensitive ticket-search or constraint hyperparameters.

In computational terms, CTS eliminates the need for κ(0,1]\kappa\in(0,1]7 prune–retrain cycles (each a full or partial retrain) as required by LTR. Ticket search proceeds over a frozen network; e.g., on ResNet-20 (CIFAR-10), 7.9 minutes are sufficient for 99.3% sparsity with 74.0% accuracy, compared to 95.2 minutes for LTR with 68.3% accuracy. On ImageNet with ResNet-50, CTS achieves ∼12x speedup with higher accuracy in the 99% sparse regime.

Method Compute (epochs) Test Acc. (%) Sanity Checks Passed
LTR 3058 80.90
SNIP 160 67.73
GraSP 160 62.59
SynFlow 161 70.18
Gem-Miner 320 77.89
Quick CTSκ(0,1]\kappa\in(0,1]8 180 79.04
CTSκ(0,1]\kappa\in(0,1]9 320 80.26

6. Empirical Results and Sanity Checks

CTS was evaluated on CIFAR-10 (ResNet-20, VGG-16) and ImageNet (ResNet-50). In high-sparsity regimes (minm{0,1}d E(x,y)D[R(f(x;mθ),y)]s.t.m0=κd(1)\min_{m\in\{0,1\}^d} \ \mathbb{E}_{(x, y)\sim\mathcal{D}} \left[\mathcal{R}\big(f(x;\, m \odot \theta),\, y\big)\right] \quad \text{s.t.} \quad \|m\|_0 = \kappa d \quad\quad (1)095% sparsity, up to 99.8%), CTSminm{0,1}d E(x,y)D[R(f(x;mθ),y)]s.t.m0=κd(1)\min_{m\in\{0,1\}^d} \ \mathbb{E}_{(x, y)\sim\mathcal{D}} \left[\mathcal{R}\big(f(x;\, m \odot \theta),\, y\big)\right] \quad \text{s.t.} \quad \|m\|_0 = \kappa d \quad\quad (1)1 consistently outperformed both LTR and all tested PaI methods in terms of accuracy, computational efficiency, and robustness to established ablations.

Following standard sanity checks, valid tickets should fail if their mask is shuffled by layer, mask selection scores are inverted, or kept weights are re-initialized. CTS passes all such checks, with accuracy dropping markedly under ablations—a contrast to PaI methods, which often do not. Layerwise analysis shows that CTS (like LTR) preserves density in first/last layers while heavily pruning intermediate layers, while saliency-based PaI often collapses critical layers.

CTS tickets closely match the logit trajectories of the dense parent under reverse KL, and retained subnetworks follow the same early feature-map and gradient-norm dynamics as their dense ancestors. This demonstrates effective preservation of training dynamics, pivotal for high sparse-mode accuracy.

7. Practical Implications and Significance

CTS reframes lottery ticket discovery as a single, continuous optimization, combining (i) a low-variance Concrete relaxation of the mask search, (ii) an adaptive GradBalance scheme for precise sparsity, and (iii) training-dynamic-inspired knowledge distillation objectives. The method achieves (1) high-quality subnetworks that match or surpass LTR in the highly sparse regime; (2) compliance with all established winning ticket sanity checks; and (3) a dramatic reduction in computational burden—typically, a minm{0,1}d E(x,y)D[R(f(x;mθ),y)]s.t.m0=κd(1)\min_{m\in\{0,1\}^d} \ \mathbb{E}_{(x, y)\sim\mathcal{D}} \left[\mathcal{R}\big(f(x;\, m \odot \theta),\, y\big)\right] \quad \text{s.t.} \quad \|m\|_0 = \kappa d \quad\quad (1)2–minm{0,1}d E(x,y)D[R(f(x;mθ),y)]s.t.m0=κd(1)\min_{m\in\{0,1\}^d} \ \mathbb{E}_{(x, y)\sim\mathcal{D}} \left[\mathcal{R}\big(f(x;\, m \odot \theta),\, y\big)\right] \quad \text{s.t.} \quad \|m\|_0 = \kappa d \quad\quad (1)3 speedup at high sparsity. This suggests CTS is a robust and efficient approach for sparse subnetwork extraction, particularly in cases where computational efficiency and verification of winning ticket properties are essential (Arora et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Concrete Ticket Search (CTS).