Papers
Topics
Authors
Recent
Search
2000 character limit reached

Layer-Wise Subset Update

Updated 16 May 2026
  • Layer-wise subset update is a selective training technique that updates only chosen layers, enhancing computational efficiency and reducing over-parameterization.
  • It leverages static, greedy, gradient-based, and adaptive selection methods to optimize convergence speed and robustness across tasks.
  • This approach is critical in applications like fine-tuning, federated learning, and meta-learning, balancing performance with resource constraints.

Layer-wise subset update refers to the selective adaptation of only a subset of a model’s layers during training, fine-tuning, or inference. This approach, developed across deep learning and probabilistic modeling, aims to improve statistical efficiency, memory or communication overhead, convergence speed, and generalization, by exploiting the heterogeneous roles and task sensitivities of different layers. The subset of layers to be updated can be selected statically, greedily, stochastically, or adaptively, often according to gradient-based, information-theoretic, or architecture- or task-driven criteria.

1. Principles and Motivation

Full-model updating, where all layers’ parameters are updated in each optimization step, has been canonical in deep learning. However, empirical and theoretical analyses have demonstrated that:

Layer-wise subset update frameworks formalize and exploit these observations with principled methods for selection, parameterization, and scheduling.

2. Subset Selection Criteria and Algorithms

Layer subset selection in this context is multimodal:

  • Fixed or static selection: Pre-determined layers or blocks are always updated. Examples include tuning only the last kk layers or a consecutive “mid-block” (Kaplun et al., 2023, Zhao et al., 12 Apr 2026).
  • Greedy or performance-driven: Layers are chosen by measuring marginal gains in validation accuracy when each candidate block is updated, as in SubTuning (Kaplun et al., 2023). Greedy selection evaluates subsets iteratively for marginal improvement.
  • Gradient-norm or signal-based: Update only layers whose gradients have large norms, under the hypothesis that these layers contribute disproportionately to loss descent (IR-Tuning) (Liu et al., 30 Sep 2025), or use mean gradient norm as a soft-importance metric (GRASS) (Tian et al., 9 Apr 2026).
  • Dynamic/Adaptive/Sampling:
    • Adapting via statistics: GRASS produces per-layer sampling probabilities by softmaxing (with temperature) the mean gradient norms, adaptively resampled per training stage (Tian et al., 9 Apr 2026).
    • Bandit/multi-armed selection: AdaLeZO uses a nonstationary multi-armed bandit to allocate sampling budget to layers with high “reward” (proxy for sensitivity), adjusting over time (Wang et al., 20 Apr 2026).
    • Variance minimization/splitting: IR-Tuning finds a dynamic cutoff in gradient norm that minimizes within-set variance for important/redundant partitioning (Liu et al., 30 Sep 2025).
    • Personalization via conflict: In federated learning, the degree of layer-wise client gradient conflict is measured (e.g., via cosine similarity), and highly conflicting layers are excluded from global aggregation (Nguyen et al., 2024).
    • Straggler-aware scheduling: In synchronous federated learning, only layers updated by all (or most) clients are included in the aggregation per round, preserving partial updates from “stragglers” (Lang et al., 2024).
  • Optimization or geometric diagnostics: Information-theoretic and geometric metrics identify “stable plateaux” by measuring, e.g., Rényi entropy, effective rank, or CKA across layers (Mid-Block Efficient Tuning) (Zhao et al., 12 Apr 2026).
  • Stochastic or randomized progressive selection: Randomly select index or block samples each step for update (Drop-Muon, AdaLeZO), leveraging variance-reduced exploration or backward-pass efficiency (Gruntkowska et al., 2 Oct 2025, Wang et al., 20 Apr 2026).

3. Mathematical Frameworks and Update Schedules

Let Θ={θ1,,θL}\Theta = \{\theta_1,\dots,\theta_L\} denote model layer parameters.

  • Update mask: m{0,1}Lm \in \{0,1\}^L, with mj=1m_j=1 iff layer jj is active. Subset update solves

minΘ,mL(fΘm)+λR(m)\min_{\Theta, m} \mathcal{L}(f_{\Theta \odot m}) + \lambda R(m)

where R(m)=m0R(m) = \|m\|_0 (Kaplun et al., 2023).

  • Gradient-based selection: Compute aj=θjLFa_j = \| \nabla_{\theta_j} \mathcal{L} \|_F, then select layers according to dynamic splitting (IR-Tuning), static threshold, or soft sampling (GRASS).
  • Randomized update scheduling (Drop-Muon): At step kk, sample Sk{1,,L}S^k \subset \{1,\ldots,L\} and update only Θ={θ1,,θL}\Theta = \{\theta_1,\dots,\theta_L\}0 using block-specific stepsizes and potentially non-Euclidean LMOs (Gruntkowska et al., 2 Oct 2025).
  • Structured sequential scheduling: In federated or block-coordinate regimes, cycle through layers in pre-defined (or optimized) order, updating each block for several rounds (FedPart) (Wang et al., 2024).
  • Adaptive/Personalized masks: For federated learning, select Θ={θ1,,θL}\Theta = \{\theta_1,\dots,\theta_L\}1 based on inter-client gradient conflict score Θ={θ1,,θL}\Theta = \{\theta_1,\dots,\theta_L\}2, keeping highly conflicting layers local (Nguyen et al., 2024).

4. Empirical Performance and System Trade-Offs

Layer-wise subset update approaches have been empirically benchmarked on image classification, text classification, LLM alignment, federated learning, and few-shot meta-learning.

Method Accuracy vs. Full Convergence Resource Savings Task Regime
IR-Tuning (Liu et al., 30 Sep 2025) Comparable or ↑5–8% 2–3× faster 20–30% lower GPU memory LLM fine-tuning, revision classification
SubTuning (Kaplun et al., 2023) Comparable or ↑ (low-data) As fast/faster ≪10% layers tuned, ~95% acc Vision, multi-task, few-shot, low data
GRASS (Tian et al., 9 Apr 2026) Up to +4.38% Matches full Up to –62.8% memory (GPU) LLM fine-tuning (arithmetic, commonsense)
Drop-Muon (Gruntkowska et al., 2 Oct 2025) Parity 1.4× speedup Sublinear forward+back cost ConvNet training, large batch/epoch
FedLUAR (Kim et al., 14 Mar 2025) Parity Parity Down to 17% comm. cost Distributed FL (CIFAR, AG News, FEMNIST)
FedPart (Wang et al., 2024) +1–3pp Fewer rounds 28% comm, 73% compute Federated learning (ResNets, Transformers)
FedLAG (Nguyen et al., 2024) Up to +5% over baselines Improved Layer-personalized comm. Personalized FL, non-IID, ResNets
SALF (Lang et al., 2024) Up to +50% over drop-strag Robust No wasted partial updates Synchronous FL, high-straggler
AdaLeZO (Wang et al., 20 Apr 2026) Parity 1.7–3.0× Linear in active layers ZO fine-tuning (LLM, batch/seq scaling)
LWAU (Qin et al., 2020) ↑margin ≥5× speedup Focused on top layers Few-shot, meta-learning (FSIC)

Key findings include:

5. Theoretical Properties and Guarantees

Layer-wise subset update methods admit convergence and generalization guarantees:

  • Generalization: SubTuning achieves generalization error Θ={θ1,,θL}\Theta = \{\theta_1,\dots,\theta_L\}3 when tuning Θ={θ1,,θL}\Theta = \{\theta_1,\dots,\theta_L\}4 parameters compared to Θ={θ1,,θL}\Theta = \{\theta_1,\dots,\theta_L\}5 for full fine-tuning (Kaplun et al., 2023).
  • Optimization: Drop-Muon attains Θ={θ1,,θL}\Theta = \{\theta_1,\dots,\theta_L\}6 (deterministic) rates over a weighted sum of gradient norms, with cost-optimality determined by block-wise smoothness; selective updates are shown to be strictly better unless smoothness is uniform (Gruntkowska et al., 2 Oct 2025).
  • Federated Learning: FedPart and FedLUAR admit convergence to stationary points or a noise neighborhood, with communication/computation proportional to the active layer fraction (Wang et al., 2024, Kim et al., 14 Mar 2025).
  • Variance/Unbiasedness: AdaLeZO’s inverse probability weighting preserves unbiasedness and upper-bounds variance under clipping (Wang et al., 20 Apr 2026).
  • Meta-learning: LWAU adapts per-layer rates, learning to focus updates mathematically and empirically on high-impact layers for few-shot efficiency (Qin et al., 2020).
  • Straggler robustness: SALF is unbiased relative to full FedAvg and achieves guaranteed Θ={θ1,,θL}\Theta = \{\theta_1,\dots,\theta_L\}7 convergence (Lang et al., 2024).

6. Design Trade-Offs, Extensions, and Open Problems

Layer-wise subset update incurs several architectural and statistical trade-offs:

  • Dynamic vs. fixed masks: Greedy or gradient-driven methods adapt to task or stage, but introduce scheduling overhead and complexity (Liu et al., 30 Sep 2025, Tian et al., 9 Apr 2026).
  • Communication/computing cost: Block-coordinate FL protocols allow partial knowledge sharing at the expense of full-layer cooperation; tuning the subset size and warmup is crucial (Wang et al., 2024, Kim et al., 14 Mar 2025).
  • Catastrophic forgetting: Freezing top layers or restricting updates to “stable” mid-blocks can mitigate overwriting, but may limit adaptation for some transfer tasks (Zhao et al., 12 Apr 2026).
  • Parallelism: Chromatic partitioning enables parallel sampling/updating in graphical models, with empirical wall-time gains (Brown et al., 2017).
  • Personalization/conflict: Gradient conflict-based selection naturally splits layers into global/local aggregation for personalized federated learning (Nguyen et al., 2024).
  • Randomness: Stochastic progressive selection (Drop-Muon) leverages hardware efficiency (backward caching), but requires careful schedule design (Gruntkowska et al., 2 Oct 2025).
  • Few-shot and low-data regimes: Subset updating prevents overfitting, yielding large gains over full and linear probe approaches (Kaplun et al., 2023, Qin et al., 2020).

Outstanding challenges include automated mask generation, adaptation to deeper/more heterogeneous architectures, integration with modular or expert-based models, and robust selection under dynamic or noisy training settings.

7. Applications Across Machine Learning Domains

Layer-wise subset update has become a foundational tool across:

In sum, layer-wise subset update provides a unifying methodology for efficient and adaptive neural network and probabilistic model training, enabling modern statistical learning in hardware-, communication-, and data-constrained environments.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Layer-Wise Subset Update.