Selective Joint Fine-Tuning in Deep Learning

Updated 15 March 2026

Selective Joint Fine-Tuning is a method that jointly optimizes primary and auxiliary objectives using selective data and parameter updates to mitigate overfitting and negative transfer.
It employs data-level selection via filter bank responses and adaptive hard-sample expansion to enhance performance on small-sample recognition tasks.
The framework extends to LLMs and federated learning by tuning core parameters or specific layers, achieving notable gains in accuracy and efficiency.

Selective joint fine-tuning is a framework for adapting large deep networks to new tasks under resource constraints, data scarcity, or the need for targeted edits, by dynamically selecting and jointly optimizing a subset of model parameters or data. This paradigm combines principles of transfer learning, multi-task learning, and parameter-efficient adaptation across domains including computer vision, language modeling, federated learning, and model unlearning. The essence of selective joint fine-tuning lies in two axes of selectivity: (1) joint optimization over both primary (“target”) and auxiliary (“source”) objectives, and (2) restriction to a judiciously chosen subset of data, neurons, parameters, or layers, often driven by gradient statistics or low-level characteristics.

1. Foundational Motivation and Frameworks

Selective joint fine-tuning addresses the instability, inefficiency, or suboptimality of indiscriminate parameter updates in scenarios where tasks, domains, or clients differ in size, data distribution, or operational constraints. In the original formulation for small-sample visual recognition, a “target” dataset with limited supervision is paired with a “source” domain of abundant data; but only a subset of source examples with similar low-level statistics (e.g., filter-bank descriptors) is used to regularize learning of shared convolutional features (Ge et al., 2017). This sample-level selection mitigates overfitting and improves the utility of source-derived features in the target context.

Extensions of this paradigm in LLMs, domain generalization, and federated learning operationalize selectivity at the level of neuron groups, parameter subspaces, or layers. In all cases, joint optimization maintains multi-objective signals without interference, while the selective restriction—either data-centric or parameter-centric—shields against noise, catastrophic forgetting, or negative transfer (Wang et al., 29 Aug 2025, Pan et al., 23 Aug 2025, Sun et al., 2024).

2. Data-Level and Filter Bank Selection

The data-centric selective joint fine-tuning approach pioneered by Ge and Yu (Ge et al., 2017) employs filter-bank responses to select source samples whose low-level characteristics closely match those of the target set. Formally, for a target task $\mathcal{T}_t$ with limited data $\mathcal{D}_t$ and a source task $\mathcal{T}_s$ with abundant data $\mathcal{D}_s$ , the method constructs per-image descriptors $\Phi(x)$ by concatenating histograms $\phi_h$ of filter responses (Gabor or early CNN). For each $x_i^t \in \mathcal{D}_t$ , its $K_0$ nearest neighbors in $\mathcal{D}_s$ are retrieved via symmetric KL divergence between descriptors. The resulting union $\mathcal{D}_s'$ regularizes shared convolutional weights during joint fine-tuning.

A key feature is adaptive hard-sample expansion: after each fine-tuning epoch, target samples that remain hard (misclassified or with high entropy) receive more neighbors in the next round. The architectural setup consists of frozen lower layers pre-initialized (e.g. from ResNet-152, ImageNet), two classification heads $\mathcal{D}_t$ 0 and $\mathcal{D}_t$ 1, and a total loss that balances target and selected-source cross-entropy, typically with $\mathcal{D}_t$ 2.

Empirically, this approach improves accuracy over standard fine-tuning by 2–10% across small-data benchmarks (e.g., +9.8% on Stanford Dogs 120, +4.1% on Caltech-256 @ 15 samples/class), with regularization preventing overfitting on the scarce target data (Ge et al., 2017).

3. Parameter and Neuron-Level Selectivity

Recent advances expand selective joint fine-tuning by identifying, clustering, and jointly fine-tuning only parameter subsets (“core parameters”) most relevant to adaptation or unlearning objectives.

Core Parameter Isolation Fine-Tuning (CPI-FT), introduced for LLMs in (Wang et al., 29 Aug 2025), begins by probing each downstream task with single-task SFT and marks as “core” the top $\mathcal{D}_t$ 3 parameters by update magnitude ( $\mathcal{D}_t$ 4 by default). Tasks are grouped by Jaccard overlap of core masks, forming clusters for joint modeling. In the joint stage, core regions are transplanted directly per task, non-core parameters are fused via SLERP (spherical linear interpolation), and subsequent multi-stage SFT updates only unfrozen (non-core) regions. This approach consistently outperforms vanilla (multi-task or staged) SFT, particularly in minimizing catastrophic forgetting (5.7 vs 24.5 points on LLaMA-2-7B), and provides a principled workflow for precise, non-destructive adaptation.

Selective Concept Unlearning (TRUST), for text-to-image diffusion, dynamically locates cross-attention “concept neurons” via CLIP-based alignment gradients, forming binary masks that identify tunable parameters. At each step, these masks are updated to track representation drift, and a Hessian-based regularization penalizes sensitivity to selected parameters, ensuring sharp, local edits without utility loss elsewhere. The procedure rapidly and reliably unlearns individual, composite, or conditional concepts (60–100 steps, <2% ASR), with superior efficiency and generalization compared to full fine-tuning or static mask baselines (Mansi et al., 8 Feb 2026).

4. Selective Joint Fine-Tuning Across Domains and Clients

In cross-domain and federated adaptation, selective joint fine-tuning is formulated as a sparse update problem, where only a small, judiciously chosen subset of parameters or layers is fine-tuned, preserving generalization.

Joint Parameter Selection (JPS) (Pan et al., 23 Aug 2025) targets domain generalization by defining, for a pre-trained $\mathcal{D}_t$ 5 and source domains $\mathcal{D}_t$ 6, a joint mask $\mathcal{D}_t$ 7 of $\mathcal{D}_t$ 8 parameters. The importance operator selects parameters in the top- $\mathcal{D}_t$ 9 gradient magnitude for all domains (intersection), and the variance operator further restricts to those with low across-domain gradient variance, filtering for parameters with both importance and cross-domain alignment. Only this static, sparse subset ( $\mathcal{T}_s$ 0) is then fine-tuned, achieving state-of-the-art domain generalization (76.4% on DomainBed benchmarks), outperforming full and prompt/adapter-based fine-tuning.

Selective Layer Fine-Tuning in Federated Learning (Sun et al., 2024) extends the principle to per-client layer selection. In each federated round, clients report per-layer gradient norms; the central server coordinates selection to maximize global improvement (priority on high-norm layers) while penalizing mask divergence (weighted by a consensus parameter $\mathcal{T}_s$ 1) to balance layer-level heterogeneity. Theoretical analysis decomposes convergence error into layer omission ( $\mathcal{T}_s$ 2) and selection inconsistency ( $\mathcal{T}_s$ 3), showing empirically that dynamic gradient-driven selection matches or exceeds full fine-tuning on both image and text benchmarks at up to 8-fold lower computation/communication cost.

5. Optimization, Hyperparameterization, and Theoretical Guarantees

Optimizing selective joint fine-tuning procedures varies with the granularity of selectivity and the application domain:

Losses and Schedules: All frameworks employ standard cross-entropy or task-specific proxy losses. Regularization weights (e.g., $\mathcal{T}_s$ 4 in (Ge et al., 2017), $\mathcal{T}_s$ 5 in (Mansi et al., 8 Feb 2026)) manage the trade-off between adaptation and preservation.
Sparsity and Mask Selection: Sparsity hyperparameters ( $\mathcal{T}_s$ 6 for core parameters, $\mathcal{T}_s$ 7 for parameter masks, $\mathcal{T}_s$ 8 neighbors, or client budgets $\mathcal{T}_s$ 9 for layers) are chosen by empirical tuning or according to validation metrics; typical values are $\mathcal{D}_s$ 0, $\mathcal{D}_s$ 1 to $\mathcal{D}_s$ 2, and neighborhood sizes so that $\mathcal{D}_s$ 3 for vision (Ge et al., 2017, Wang et al., 29 Aug 2025, Pan et al., 23 Aug 2025).
Gradient and Curvature Statistics: Selection drivers include per-task update magnitudes, gradient norms, and variance (for stability and generalization), and, in TRUST, local curvature via Hessian-based penalties to stabilize critical neuron edits (Mansi et al., 8 Feb 2026).
Theoretical Analysis: Sparse-update generalization guarantees relate the expected error to sparsity, gradient alignment, and inter-domain discrepancy (Pan et al., 23 Aug 2025). In federated settings, convergence bounds reveal that both omission of high-gradient layers and cross-client inconsistency can create non-vanishing error floors (Sun et al., 2024).

6. Empirical Results, Impact, and Practical Recommendations

Empirical evaluations consistently show that selective joint fine-tuning outperforms naive or static partial tuning and full fine-tuning, particularly under resource limitations, data scarcity, or heterogeneity.

Area	Task/Benchmark	Selectivity Type	Key Gains	Source
Visual Transfer	Stanford Dogs, Caltech-256	Data/sample-based	+2–10% accuracy vs. FT; SOTA on 4 datasets	(Ge et al., 2017)
LLM Adaptation	GSM8K, Alpaca, UltraChat	Parameter/subspace	+0.63 avg-normed points; 5.7× less forgetting	(Wang et al., 29 Aug 2025)
Domain General.	DomainBed (5 vision sets)	Parameter/sparse	76.4% vs. 72.2–75.5% SOTA; 1000× fewer params tuned	(Pan et al., 23 Aug 2025)
Federated FL	CIFAR-10, XGLUE, QA	Layer/client-based	Full FT-matching accuracy @ 1/8 cost, stable conv.	(Sun et al., 2024)
Model Unlearning	Stable Diffusion (SD v1.5)	Neuron/dynamic	<2% ASR, <0.03 ΔFID, 1–2 orders fewer steps	(Mansi et al., 8 Feb 2026)

Practical recommendations include prioritizing layers/parameters by gradient statistics, dynamically updating selection masks, leveraging adaptive hard-sample expansion, and calibrating consensus regularization across clients or domains. Across all domains, selective joint fine-tuning is robust to distribution shift, reduces catastrophic interference, and scales to complex composition/decomposition tasks.

7. Extensions, Limitations, and Future Directions

Selective joint fine-tuning continues to evolve. Prominent extensions include compositional and conditional concept editing (TRUST's seamless joint erasure of combinations (Mansi et al., 8 Feb 2026)), adaptive/epoch-wise dynamic mask updates, and the incorporation of curvature-aware or second-order selection criteria. Emerging directions include probing the interplay between selection granularity (layer, neuron, parameter) and task/data geometry, better theoretical characterization of mask stability, and the generalization of the paradigm to novel domains (beyond vision/language/generative models). A plausible implication is that efficient, robust, and targeted adaptation will increasingly rely on joint, adaptive, and principled selectivity—coherently integrating parameter, layer, data, and neuron-level criteria.