Gradient Freezing & Lottery Ticket Scratcher
- Gradient freezing and LTS are methods that halt updates on static weights to streamline the search for effective subnetworks and lower computational overhead.
- They employ techniques like task-agnostic mask training and EMA-based quantization monitoring to efficiently identify and freeze optimal weight configurations.
- Empirical findings demonstrate that these approaches can improve accuracy, speed up subnetwork discovery, and significantly reduce FLOPs in models like BERT, MobileNetV2, and ResNet.
Gradient freezing and Lottery Ticket Scratcher (LTS) methods represent a family of approaches that exploit the static or semi-static behavior of neural network weights to optimize the search for effective subnetworks and to reduce computational overhead during model training. This article synthesizes the core methodologies, empirical findings, and implications of gradient freezing and LTS, focusing on their structured application to transfer learning, @@@@1@@@@ (QAT), and sparse lottery ticket (LT) search, as well as their theoretical and practical relationship to the Lottery Ticket Hypothesis (LTH).
1. Conceptual Foundations
Gradient freezing refers to the process of halting the gradient-based updates of certain network parameters during training, for example when weights have converged to a static or near-static value relative to their operational effect (such as their quantized counterpart). The LTS approach, in particular, provides a principled mechanism to identify such weights dynamically during training, based on their distance to discrete quantization bins.
Within the broader context of LTH, subnetworks (or "lottery tickets") are subnetworks of a dense model that can match or even exceed the accuracy of the original model after being suitably masked or selected. Freezing, masking, and supermask search techniques (notably Edge-Popup) exploit either pre-training, initialization, or early-in-training statistics to locate such subnetworks efficiently (Liu et al., 2022, Zhong et al., 2022, Otsuka et al., 2024).
2. Task-Agnostic Mask Training and Weight Freezing in Transfer Learning
Task-agnostic mask training (TAMT) for models such as BERT operationalizes the concept by freezing all original weights and training binary masks on pre-training objectives, specifically:
- Binary Mask Parameterization: With and binary , the masked network is .
- Objective: The mask is optimized to minimize the pre-training loss , either:
- Masked Language Modeling (MLM):
- Hidden-state Distillation (KD):
where denote teacher (unpruned) and student (masked) hidden states.
- Optimization: Real-valued mask parameters are binarized in the forward pass, with a threshold set to achieve the target sparsity. Mask gradients are computed via the straight-through estimator (STE), and only the mask parameters are updated; remains fixed during the search (Liu et al., 2022).
- Empirical Results (BERT₍BASE₎): At 70% sparsity, TAMT-MLM achieves average GLUE+SQuAD ≈ 83.8 versus ≈ 82.1 for iterative magnitude pruning (IMP) and ≈ 79.5 for one-shot magnitude pruning (OMP); SQuAD F1 is ≈ 85.2 (TAMT-MLM) versus 82.9 (IMP). TAMT yields an 8.7× search speedup over IMP at comparable performance. The fine-tuned accuracy on few-shot subsets also favors TAMT (SST-2 10K: 91.8% TAMT-MLM vs. 89.9% IMP).
Task-agnostic mask training demonstrates that by freezing weights and exploring the binary mask space under pre-training objectives, a subnetwork with superior transferability can be located efficiently and robustly.
3. The Partly Scratch-Off Lottery Ticket in Quantization-Aware Training
In QAT, the "partly scratch-off lottery ticket" describes the phenomenon where a substantial portion of network weights quickly stabilize to their optimal quantization bins early in training. These weights, when monitored for the distance between the normalized weight and its quantized value (with -bit quantizer), exhibit minimal movement relative to their assigned quant level after a few epochs (Zhong et al., 2022).
Formally, after normalizing:
the set of weights is considered "scratched off" if
and the proportion of such weights reaches 50–80% within 10–20% of total QAT epochs.
4. The Lottery Ticket Scratcher (LTS) Algorithm and Computational Gains
The LTS algorithm leverages the above phenomenon as follows:
- Weight Monitoring: For each weight , an exponential moving average (EMA) of its quantization bin distance is updated:
where is the EMA momentum.
- Freezing Criterion: If after the warmup period, the weight is frozen (zeroed gradient). The threshold is scheduled via a linear-growth function relative to the quantization interval .
- Empirical Outcomes: LTS yields 50–70% of weights frozen by the latter stages of QAT, with corresponding reductions of 23–35% in backward-pass FLOPs. Crucially, accuracy is typically maintained or improved; for example, on 2-bit MobileNetV2, LTS yields a 5.05% accuracy improvement with 46% fewer weight updates and 23% lower backward-pass FLOPs.
- Ablations: Threshold schedules, warmup duration, and EMA momentum impact the trade-off between FLOPs savings and accuracy; for instance, –20% of total epochs is effective.
Summary tables of key results from (Zhong et al., 2022):
| Dataset/Model | Mode | Accuracy (%) | Weight Grad Sparsity | Backprop FLOPs ↓ |
|---|---|---|---|---|
| ResNet-20 CIFAR100 (2/2) | QAT | 52.95 | 0% | 0% |
| QAT+LTS | 53.76 | 66% | 33% | |
| MobileNetV2 2-bit | QAT | 40.56 | 0% | 0% |
| QAT+LTS | 45.61 | 46% | 23% |
This illustrates that freezing based on the LTS principle can reduce computational burden with little or no accuracy compromise.
5. Freezing and Early Supermask Search in Strong Lottery Ticket Identification
Weight freezing is also foundational in the construction of strong lottery tickets (SLTs) in random or partially structured neural networks. (Otsuka et al., 2024) introduces a method of random partial freezing at initialization, splitting the parameter set into:
- Pruned parameters (): permanently zeroed, never to appear in any subnetwork.
- Locked parameters (): always included, cannot be dropped from the subnetwork.
- Searchable parameters: only these are explored in "supermask" or Edge-Popup search.
The SLT search is then conducted over the reduced search space, with Edge-Popup updating only the real-valued scores of the unfrozen parameters. The final mask is the union of all locked parameters, the pruned-out parameters (zeros), and those selected by the search.
Key empirical outcomes:
| Method (Source) | Sparsity | Top1 (%) | Model Size (MB) | ||
|---|---|---|---|---|---|
| Weight Training | 0% | 0% | 0% | 87.1 | 8.63 |
| SLT (Dense) | 50% | 0% | 0% | 86.2 | 0.27 |
| SLT (Sparse) | 50% | 45% | 0% | 66.7 | 0.15 |
| SLT (Frozen) | 50% | 25% | 25% | 84.8 | 0.13 |
Freezing 70% of a ResNet on ImageNet achieves 3.3× compression over a dense-SLT and improves accuracy by up to 14.12 points over sparse SLTs from randomly pruned sources.
6. Implications, Theoretical Rationale, and Extensions
Freezing, as realized via LTS or partial SLT search, decisively refines the training search space. Under QAT, freezing quenches oscillations among quantized weights that have stabilized, reducing noise and sometimes enhancing convergence, particularly under aggressive quantization (e.g., 2–3 bits). In SLT search, freezing decouples the density of the subnetwork found from the search-space size, enabling efficient exploration and markedly improved accuracy–size trade-offs.
Subset-sum approximation theory in (Otsuka et al., 2024) justifies that sufficiently expressive SLTs exist within partially frozen networks, showing that lottery-like subnetworks are not unique to fully dense or purely pruned scenarios, but also characterize frozen architectures.
A plausible implication is that gradient freezing and LTS offer a mechanism for hardware-optimized and energy-efficient training, opening future directions toward "gradient-free" or "gradient-light" learning paradigms at scale. Additionally, the alignment of pre-training loss with downstream transferability in mask search (Liu et al., 2022) provides a statistical foundation for task-agnostic subnetwork selection.
7. Discussion of Limitations and Open Questions
While LTS and related freezing approaches reliably deliver computational savings and, under many settings, accuracy improvements or parity, several points remain underexplored. The long-term effects of freezing on generalization in highly non-i.i.d. regimes, and possible interactions between the freezing schedule and optimizer hyperparameters, are areas for further investigation. The precise mechanisms by which freezing contributes to improved or robust performance, particularly in the context of strong lottery tickets at extreme sparsity or quantization, remain to be completely characterized.
The general principle underlying these methods is that early stabilization—whether by quantization, mask learning, or static initialization—can be harnessed to focus trainable resources onto the subset of parameters most critical for convergence or transfer, revealing new dimensions of structural efficiency in deep networks.