Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parameter Group Freezing

Updated 21 April 2026
  • Parameter group freezing is a structured approach that partitions model parameters and selectively freezes groups to reduce memory/compute costs and boost efficiency.
  • It employs static or dynamic schedules based on metrics like gradient accumulation, activation statistics, and parameter drift to determine which groups to freeze.
  • The technique is widely applied in neural networks, quantum circuits, federated learning, and lattice gauge simulations, balancing training efficiency with model capacity.

Parameter group freezing is a structured approach for accelerating optimization, reducing memory/compute costs, or improving generalization by selectively fixing (i.e., freezing) disjoint subsets of model parameters during training or adaptation. The strategy is widely used in neural networks, quantum circuits, pipeline-parallel training, federated learning, continual learning, and lattice gauge simulations, with the specifics of partitioning, importance scoring, and scheduling varying by application domain.

1. Formal Definitions and Partitioning Strategies

Parameter group freezing entails partitioning the full parameter set θ\theta into groups {θg}g=1G\{\theta_g\}_{g=1}^{G}, where each group represents either contiguous tensors (layers, adapters), functionally similar blocks (encoders, decoders), or arbitrary index subsets (coordinates, LoRA projections). For frozen groups, gradients are not computed or applied: θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases} The identification and scheduling of frozen groups may occur statically (e.g., fixed decoder freezing in PEFT (Dhole, 14 Jan 2025)) or dynamically, based on importance metrics computed during training (e.g., WSBD (Kverne et al., 11 Feb 2026), AFLoRA (Liu et al., 2024), TimelyFreeze (Cho et al., 5 Feb 2026)).

Partitioning granularity is highly task-dependent:

2. Importance Criteria and Group Selection

Critical to most adaptive freezing schemes is an “importance score” quantifying a parameter/group's contribution to learning/progress.

  • Gradient-accumulation metrics: WSBD defines a per-parameter sum-of-gradients over a sliding window, sk=t=1τC/θk+ϵs_k = |\sum_{t=1}^\tau \partial \mathcal{C}/\partial \theta_k| + \epsilon, assigning lowest-probability of activation to parameters with small magnitudes (Kverne et al., 11 Feb 2026).
  • Activation statistics: Layer mining for continual detection freezes groups according to (mean, median, variance, entropy) of activations (Menezes et al., 2024).
  • Gradient-based stability: AFLoRA computes EMAs of gradient magnitudes and their changes to generate a freezing score s(t)=mean(m(t)u(t))s_\ell(t) = \text{mean}(m_\ell(t) \odot u_\ell(t)), incrementally freezing LoRA projections with the smallest s(t)s_\ell(t) (Liu et al., 2024).
  • Parameter drift: Wireless FL monitors cumulative coordinate changes to determine which parameters have stabilized and are suitable for freezing (Ouyang et al., 2 Apr 2025).

The selection can be deterministic (freeze the lowest L% by some criterion) or stochastic (probabilistic masking as in WSBD). TimelyFreeze, in contrast, sets per-stage freeze ratios to optimize pipeline execution time under LP-imposed accuracy budgets (Cho et al., 5 Feb 2026).

3. Freezing Schedules and Adaptive Mechanisms

Several temporal strategies govern the dynamics of parameter group freezing:

  • Static freezing: Certain groups remain frozen throughout (e.g., backbone layers in continual detection (Menezes et al., 2024), frozen decoders in PEFT (Dhole, 14 Jan 2025)).
  • Progressive freezing: Groups become eligible for freezing as their importance scores indicate stabilization, employing staged or ramped schedules (AFLoRA's cubic ramp (Liu et al., 2024), TimelyFreeze's LP-based ramp (Cho et al., 5 Feb 2026)).
  • Windowed freezing/unfreezing: Parameters are only allowed to update within fixed or adaptive windows, and reactivation is permitted to retain expressivity and avoid local minima (WSBD (Kverne et al., 11 Feb 2026)).
  • Two-timescale frameworks: In wireless FL, group freezing occurs at a large timescale (frames), while transmit power is dynamically controlled at finer granularity (Ouyang et al., 2 Apr 2025).

Pseudo-code for these routines typically involves: (1) computing importance metrics; (2) sorting/grouping; (3) applying mask/freeze status; (4) updating parameter subsets; and (optionally) (5) unfreezing/reactivating as scores change.

4. Theoretical Analysis and Convergence Guarantees

Freezing subgroups of parameters induces a bias-variance trade-off and raises questions around convergence and expressive capacity:

  • Bias from under-training: Freezing introduces a persistent estimation bias proportional to the fraction and stability of frozen parameters. Federated learning bounds the total loss as E[F(wT)]FAt=1T(1μηt)+Bt=1Tηt2(1+ρyt)\mathbb{E}[F(w_T)] - F^* \leq A\prod_{t=1}^T(1-\mu\eta_t) + B\sum_{t=1}^T \eta_t^2(1+\rho y_t), where yty_t is the freeze fraction (Ouyang et al., 2 Apr 2025).
  • Hammerstein contraction in gradient masking: Under L-smoothness and acceptance probability pminp_{\min} for coordinate activation, masked-gradient SGD satisfies pmintηtE[C2]<p_\mathrm{min}\sum_{t}\eta_t \mathbb{E}[\|\nabla\mathcal{C}\|^2] < \infty, and {θg}g=1G\{\theta_g\}_{g=1}^{G}0 (Kverne et al., 11 Feb 2026).
  • Pipeline efficiency: TimelyFreeze frames the execution as an LP, constraining average per-stage freeze ratios to keep expected parameter updates above {θg}g=1G\{\theta_g\}_{g=1}^{G}1, bounding iteration complexity via {θg}g=1G\{\theta_g\}_{g=1}^{G}2 (Cho et al., 5 Feb 2026).
  • Expressive capacity: Unlike pruning, most freezing methods (except permanent parameter removal) preserve the possibility of full model adaptation, as frozen groups can be reactivated (WSBD) or limited to trainable adapters (AFLoRA).

5. Application Domains and Empirical Results

Parameter group freezing has been instantiated in a diverse set of practical machine learning and computational physics contexts:

Domain Grouping Key Outcomes
LLMs Decoder layers ~50% fewer trainables, better multilingual retention, similar/better NLG metrics (Dhole, 14 Jan 2025)
Continual Object Detection Neck/head layers Stability/plasticity tradeoff, best in class for [email protected] (e.g. 57.9 with 75% entropy-based freezing) (Menezes et al., 2024)
Parameter-Efficient FT (PEFT) LoRA projections Up to {θg}g=1G\{\theta_g\}_{g=1}^{G}3 trainable parameter reduction, +0.85% average GLUE score, up to {θg}g=1G\{\theta_g\}_{g=1}^{G}4 faster (Liu et al., 2024)
Quantum Neural Networks Parameters 63.9% faster than Adam, 80% forward-pass savings, robust to noise (Kverne et al., 11 Feb 2026)
Wireless Federated Learning Coordinates 15–20% total energy reduction for fixed accuracy, up to +3% accuracy for fixed energy (Ouyang et al., 2 Apr 2025)
Pipeline Parallelism Stage/microbatch 36.3–40% throughput gain, {θg}g=1G\{\theta_g\}_{g=1}^{G}50.2% accuracy delta (LLaMA-8B) (Cho et al., 5 Feb 2026)

For natural language generation, freezing decoders aligns with the pretraining distribution and avoids catastrophic forgetting, especially in multilingual regimes (Dhole, 14 Jan 2025). In object detection, hard layer freezing combined with entropy-based mining yields best performance/stability balance (Menezes et al., 2024). In quantum models, parameter-wise freezing via WSBD significantly mitigates the barren plateau effect (Kverne et al., 11 Feb 2026). In federated learning, coordinated two-timescale freezing and transmission control allow energy-bounded deployment with minimal loss in convergence (Ouyang et al., 2 Apr 2025).

6. Practical Recommendations and Limitations

Effective design and deployment of parameter group freezing require:

  • Appropriate importance metrics: Select activation or gradient-based criteria suited to domain (e.g., mean activation for detection, sum-of-gradients for QNNs).
  • Careful freeze fraction selection: Excessive freezing (e.g., {θg}g=1G\{\theta_g\}_{g=1}^{G}6) can severely degrade accuracy/plasticity (Menezes et al., 2024, Ouyang et al., 2 Apr 2025); moderate ranges yield most gains.
  • Dynamic/interleaved schedules: Enable adaptation to training signal shifts, especially in multitask and non-stationary learning (Kverne et al., 11 Feb 2026, Liu et al., 2024).
  • Hardware/process awareness: Assess memory/compute tradeoffs (e.g., batch size limits, forward-pass counts).
  • Hyperparameter tuning: Freeze window, warm-up period, penalty parameters, and LM thresholds can substantially affect performance.
  • Accuracy risks: Freezing is generally safe if guided by reliable signals of parameter stabilization, but improper over-freezing or failing to adjust for uneven group sizes (e.g., neglecting Voronoi weights in discrete gauge simulations) leads to premature performance drop or even procedural biases (Dhole, 14 Jan 2025, Hartung et al., 2022).

7. Specialized Variants: Freezing in Lattice Gauge Theory and Other Discrete Systems

Beyond machine learning, freezing transitions arise in digitized lattice gauge theory. There, constraints from finite group discretization (e.g., SU(2) subgroups) cause all but trivial configurations to be exponentially suppressed as coupling increases. The minimal angular separation {θg}g=1G\{\theta_g\}_{g=1}^{G}7 among group elements controls the critical freezing threshold {θg}g=1G\{\theta_g\}_{g=1}^{G}8, requiring the use of "asymptotically dense" point sets (e.g., generalised Fibonacci spiral on {θg}g=1G\{\theta_g\}_{g=1}^{G}9) and explicit weighting to avoid simulation freezing (Hartung et al., 2022).

Table: Freezing Transition Parameters in Discretized SU(2) Simulations

Discretization Minimal Angle θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}0 θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}1 at θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}2 points Notes
Binary subgroups Fixed by group order θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}3 scaling Uniform weights, small θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}4 (θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}5)
Fibonacci lattice Decreases with θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}6 Matches refined Petcher-Weingarten Near-optimal, arbitrary θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}7, nearly uniform weights

Practical guidance dictates choosing θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}8 or θg{θgηθgLif g is active θgif g is frozen\theta_g \leftarrow \begin{cases} \theta_g - \eta \nabla_{\theta_g} \mathcal{L} & \text{if } g\ \text{is active} \ \theta_g & \text{if } g\ \text{is frozen} \end{cases}9 to ensure sk=t=1τC/θk+ϵs_k = |\sum_{t=1}^\tau \partial \mathcal{C}/\partial \theta_k| + \epsilon0, and using generalised Fibonacci schemes for uniformity and maximal sk=t=1τC/θk+ϵs_k = |\sum_{t=1}^\tau \partial \mathcal{C}/\partial \theta_k| + \epsilon1 at a given computational cost (Hartung et al., 2022).

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parameter Group Freezing.