Parameter Group Freezing

Updated 21 April 2026

Parameter group freezing is a structured approach that partitions model parameters and selectively freezes groups to reduce memory/compute costs and boost efficiency.
It employs static or dynamic schedules based on metrics like gradient accumulation, activation statistics, and parameter drift to determine which groups to freeze.
The technique is widely applied in neural networks, quantum circuits, federated learning, and lattice gauge simulations, balancing training efficiency with model capacity.

Parameter group freezing is a structured approach for accelerating optimization, reducing memory/compute costs, or improving generalization by selectively fixing (i.e., freezing) disjoint subsets of model parameters during training or adaptation. The strategy is widely used in neural networks, quantum circuits, pipeline-parallel training, federated learning, continual learning, and lattice gauge simulations, with the specifics of partitioning, importance scoring, and scheduling varying by application domain.

1. Formal Definitions and Partitioning Strategies

Parameter group freezing entails partitioning the full parameter set $\theta$ into groups $\{\theta_g\}_{g=1}^{G}$ , where each group represents either contiguous tensors (layers, adapters), functionally similar blocks (encoders, decoders), or arbitrary index subsets (coordinates, LoRA projections). For frozen groups, gradients are not computed or applied: $\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ The identification and scheduling of frozen groups may occur statically (e.g., fixed decoder freezing in PEFT (Dhole, 14 Jan 2025)) or dynamically, based on importance metrics computed during training (e.g., WSBD (Kverne et al., 11 Feb 2026), AFLoRA (Liu et al., 2024), TimelyFreeze (Cho et al., 5 Feb 2026)).

Partitioning granularity is highly task-dependent:

Layer/group-wise: e.g., entire Transformer decoders (Dhole, 14 Jan 2025), RetinaNet heads (Menezes et al., 2024), or pipeline-parallel stages (Cho et al., 5 Feb 2026).
Parameter-wise: dynamic masking at the coordinate level (QNNs (Kverne et al., 11 Feb 2026), FL (Ouyang et al., 2 Apr 2025)).
Module-wise LoRA projections: separate freezing schedules for each low-rank adaptation matrix (Liu et al., 2024).

2. Importance Criteria and Group Selection

Critical to most adaptive freezing schemes is an “importance score” quantifying a parameter/group's contribution to learning/progress.

Gradient-accumulation metrics: WSBD defines a per-parameter sum-of-gradients over a sliding window, $s_k = |\sum_{t=1}^\tau \partial \mathcal{C}/\partial \theta_k| + \epsilon$ , assigning lowest-probability of activation to parameters with small magnitudes (Kverne et al., 11 Feb 2026).
Activation statistics: Layer mining for continual detection freezes groups according to (mean, median, variance, entropy) of activations (Menezes et al., 2024).
Gradient-based stability: AFLoRA computes EMAs of gradient magnitudes and their changes to generate a freezing score $s_\ell(t) = \text{mean}(m_\ell(t) \odot u_\ell(t))$ , incrementally freezing LoRA projections with the smallest $s_\ell(t)$ (Liu et al., 2024).
Parameter drift: Wireless FL monitors cumulative coordinate changes to determine which parameters have stabilized and are suitable for freezing (Ouyang et al., 2 Apr 2025).

The selection can be deterministic (freeze the lowest L% by some criterion) or stochastic (probabilistic masking as in WSBD). TimelyFreeze, in contrast, sets per-stage freeze ratios to optimize pipeline execution time under LP-imposed accuracy budgets (Cho et al., 5 Feb 2026).

3. Freezing Schedules and Adaptive Mechanisms

Several temporal strategies govern the dynamics of parameter group freezing:

Static freezing: Certain groups remain frozen throughout (e.g., backbone layers in continual detection (Menezes et al., 2024), frozen decoders in PEFT (Dhole, 14 Jan 2025)).
Progressive freezing: Groups become eligible for freezing as their importance scores indicate stabilization, employing staged or ramped schedules (AFLoRA's cubic ramp (Liu et al., 2024), TimelyFreeze's LP-based ramp (Cho et al., 5 Feb 2026)).
Windowed freezing/unfreezing: Parameters are only allowed to update within fixed or adaptive windows, and reactivation is permitted to retain expressivity and avoid local minima (WSBD (Kverne et al., 11 Feb 2026)).
Two-timescale frameworks: In wireless FL, group freezing occurs at a large timescale (frames), while transmit power is dynamically controlled at finer granularity (Ouyang et al., 2 Apr 2025).

Pseudo-code for these routines typically involves: (1) computing importance metrics; (2) sorting/grouping; (3) applying mask/freeze status; (4) updating parameter subsets; and (optionally) (5) unfreezing/reactivating as scores change.

4. Theoretical Analysis and Convergence Guarantees

Freezing subgroups of parameters induces a bias-variance trade-off and raises questions around convergence and expressive capacity:

Bias from under-training: Freezing introduces a persistent estimation bias proportional to the fraction and stability of frozen parameters. Federated learning bounds the total loss as $\mathbb{E}[F(w_T)] - F^* \leq A\prod_{t=1}^T(1-\mu\eta_t) + B\sum_{t=1}^T \eta_t^2(1+\rho y_t)$ , where $y_t$ is the freeze fraction (Ouyang et al., 2 Apr 2025).
Hammerstein contraction in gradient masking: Under L-smoothness and acceptance probability $p_{\min}$ for coordinate activation, masked-gradient SGD satisfies $p_\mathrm{min}\sum_{t}\eta_t \mathbb{E}[\|\nabla\mathcal{C}\|^2] < \infty$ , and $\{\theta_g\}_{g=1}^{G}$ 0 (Kverne et al., 11 Feb 2026).
Pipeline efficiency: TimelyFreeze frames the execution as an LP, constraining average per-stage freeze ratios to keep expected parameter updates above $\{\theta_g\}_{g=1}^{G}$ 1, bounding iteration complexity via $\{\theta_g\}_{g=1}^{G}$ 2 (Cho et al., 5 Feb 2026).
Expressive capacity: Unlike pruning, most freezing methods (except permanent parameter removal) preserve the possibility of full model adaptation, as frozen groups can be reactivated (WSBD) or limited to trainable adapters (AFLoRA).

5. Application Domains and Empirical Results

Parameter group freezing has been instantiated in a diverse set of practical machine learning and computational physics contexts:

Domain	Grouping	Key Outcomes
LLMs	Decoder layers	~50% fewer trainables, better multilingual retention, similar/better NLG metrics (Dhole, 14 Jan 2025)
Continual Object Detection	Neck/head layers	Stability/plasticity tradeoff, best in class for [email protected] (e.g. 57.9 with 75% entropy-based freezing) (Menezes et al., 2024)
Parameter-Efficient FT (PEFT)	LoRA projections	Up to $\{\theta_g\}_{g=1}^{G}$ 3 trainable parameter reduction, +0.85% average GLUE score, up to $\{\theta_g\}_{g=1}^{G}$ 4 faster (Liu et al., 2024)
Quantum Neural Networks	Parameters	63.9% faster than Adam, 80% forward-pass savings, robust to noise (Kverne et al., 11 Feb 2026)
Wireless Federated Learning	Coordinates	15–20% total energy reduction for fixed accuracy, up to +3% accuracy for fixed energy (Ouyang et al., 2 Apr 2025)
Pipeline Parallelism	Stage/microbatch	36.3–40% throughput gain, $\{\theta_g\}_{g=1}^{G}$ 50.2% accuracy delta (LLaMA-8B) (Cho et al., 5 Feb 2026)

For natural language generation, freezing decoders aligns with the pretraining distribution and avoids catastrophic forgetting, especially in multilingual regimes (Dhole, 14 Jan 2025). In object detection, hard layer freezing combined with entropy-based mining yields best performance/stability balance (Menezes et al., 2024). In quantum models, parameter-wise freezing via WSBD significantly mitigates the barren plateau effect (Kverne et al., 11 Feb 2026). In federated learning, coordinated two-timescale freezing and transmission control allow energy-bounded deployment with minimal loss in convergence (Ouyang et al., 2 Apr 2025).

6. Practical Recommendations and Limitations

Effective design and deployment of parameter group freezing require:

Appropriate importance metrics: Select activation or gradient-based criteria suited to domain (e.g., mean activation for detection, sum-of-gradients for QNNs).
Careful freeze fraction selection: Excessive freezing (e.g., $\{\theta_g\}_{g=1}^{G}$ 6) can severely degrade accuracy/plasticity (Menezes et al., 2024, Ouyang et al., 2 Apr 2025); moderate ranges yield most gains.
Dynamic/interleaved schedules: Enable adaptation to training signal shifts, especially in multitask and non-stationary learning (Kverne et al., 11 Feb 2026, Liu et al., 2024).
Hardware/process awareness: Assess memory/compute tradeoffs (e.g., batch size limits, forward-pass counts).
Hyperparameter tuning: Freeze window, warm-up period, penalty parameters, and LM thresholds can substantially affect performance.
Accuracy risks: Freezing is generally safe if guided by reliable signals of parameter stabilization, but improper over-freezing or failing to adjust for uneven group sizes (e.g., neglecting Voronoi weights in discrete gauge simulations) leads to premature performance drop or even procedural biases (Dhole, 14 Jan 2025, Hartung et al., 2022).

7. Specialized Variants: Freezing in Lattice Gauge Theory and Other Discrete Systems

Beyond machine learning, freezing transitions arise in digitized lattice gauge theory. There, constraints from finite group discretization (e.g., SU(2) subgroups) cause all but trivial configurations to be exponentially suppressed as coupling increases. The minimal angular separation $\{\theta_g\}_{g=1}^{G}$ 7 among group elements controls the critical freezing threshold $\{\theta_g\}_{g=1}^{G}$ 8, requiring the use of "asymptotically dense" point sets (e.g., generalised Fibonacci spiral on $\{\theta_g\}_{g=1}^{G}$ 9) and explicit weighting to avoid simulation freezing (Hartung et al., 2022).

Table: Freezing Transition Parameters in Discretized SU(2) Simulations

Discretization	Minimal Angle $\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ 0	$\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ 1 at $\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ 2 points	Notes
Binary subgroups	Fixed by group order	$\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ 3 scaling	Uniform weights, small $\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ 4 ( $\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ 5)
Fibonacci lattice	Decreases with $\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ 6	Matches refined Petcher-Weingarten	Near-optimal, arbitrary $\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ 7, nearly uniform weights

Practical guidance dictates choosing $\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ 8 or $\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}$ 9 to ensure $s_k = |\sum_{t=1}^\tau \partial \mathcal{C}/\partial \theta_k| + \epsilon$ 0, and using generalised Fibonacci schemes for uniformity and maximal $s_k = |\sum_{t=1}^\tau \partial \mathcal{C}/\partial \theta_k| + \epsilon$ 1 at a given computational cost (Hartung et al., 2022).

References

"WSBD: Freezing-Based Optimizer for Quantum Neural Networks" (Kverne et al., 11 Feb 2026)
"A Two-Timescale Approach for Wireless Federated Learning with Parameter Freezing and Power Control" (Ouyang et al., 2 Apr 2025)
"A Multi-Encoder Frozen-Decoder Approach for Fine-Tuning LLMs" (Dhole, 14 Jan 2025)
"Efficient Parameter Mining and Freezing for Continual Object Detection" (Menezes et al., 2024)
"AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models" (Liu et al., 2024)
"TimelyFreeze: Adaptive Parameter Freezing Mechanism for Pipeline Parallelism" (Cho et al., 5 Feb 2026)
"Digitising SU(2) Gauge Fields and the Freezing Transition" (Hartung et al., 2022)