Constrained Fine-Tuning: Techniques & Insights

Updated 12 November 2025

Constrained fine-tuning is a method that limits model parameter changes to maintain safety, privacy, robustness, and generalization during adaptation.
It employs techniques such as architectural freezing, dynamic adapters, and bilevel optimization to balance performance with explicit constraints.
Empirical results indicate improvements including up to 31.5% training-time reduction and enhanced safety metrics in specialized deep learning models.

Constrained fine-tuning (FT) encompasses a family of techniques that impose explicit restrictions on LLM or deep neural network adaptation, with the goal of preserving desirable properties (such as safety, privacy, generalization, or efficiency) during specialization to new data or tasks. These constraints may take the form of regularizations in parameter space, hard architectural freezing, privacy-preserving mechanisms, or bilevel objective search, and are motivated by both theoretical and empirical analysis of the vulnerabilities and instability inherent in unconstrained FT regimes.

1. Formal Taxonomy and Motivations

Constrained FT is defined by any FT protocol that limits parameter drift, restricts which model weights can be updated, tunes only a small subset of parameters, or introduces specific regularizers or objectives to enforce safety, robustness, privacy, or efficiency constraints. The motivations include:

Safety alignment: Preserving safety properties and minimizing harmful output drift during FT in LLMs (Yang et al., 10 Jun 2025).
Generalization and robustness: Preventing overfitting to distributional shifts or poisoned/biased fine-tuning data (Choi et al., 18 Jan 2024, Zhu et al., 25 Aug 2025).
Privacy: Satisfying $(\varepsilon, \delta)$ -differential privacy (DP) requirements by reducing the sensitivity of trainable parameters or optimizing privacy budget allocation (Ke et al., 29 Feb 2024).
Efficiency: Reducing compute and memory overhead for on-device or mobile adaptation by fine-tuning only a small subset (e.g., adapters or side-networks) (Yang et al., 1 Jul 2025, Jo et al., 4 Sep 2024).
Mitigating catastrophic forgetting: Selective parameter freezing to maintain world knowledge during several FT rounds (Hui et al., 29 Apr 2024).
Controlling memorization and fine-tuning capacity: Quantifying the number of samples a side-network or adapter can override while preserving the frozen backbone (Sohn et al., 1 Aug 2024).

This landscape motivates a wide range of approaches, including regularization-based methods, architectural parameter sharing, task-driven objective search, and privacy- or efficiency-aware updates.

2. Alignment-Preserving and Safety-Constrained FT

Anchoring fine-tuning within provably safe regions in parameter space is necessary to mitigate the risk of model “jailbreaks” and the unintentional loss of safety guarantees. The AsFT methodology (Yang et al., 10 Jun 2025) introduces such a constraint via the concept of the alignment direction:

Alignment direction $u_\text{aligned}$ is defined as the normalized vector difference between an aligned (safe) model and its unaligned base: $u_\text{aligned} = (\theta_\text{aligned} - \theta_\text{unaligned}) / \|\theta_\text{aligned} - \theta_\text{unaligned}\|_2$ .
Safety basin: Empirically, perturbations along $u_\text{aligned}$ preserve safety over a wide range ( $\epsilon_1$ ), while orthogonal perturbations rapidly induce unsafe behavior ( $\epsilon_2 \ll \epsilon_1$ ).
Regularization: At each optimization step, decompose the parameter update $\Delta \theta$ into components along $u_\text{aligned}$ and its orthogonal complement:

$\mathcal{L}_\text{total} = \mathcal{L}_\text{task} + \lambda \| \Delta \theta - (\Delta \theta \cdot u_\text{aligned}) u_\text{aligned} \|_2^2$

Experimental results: AsFT outperforms Safe LoRA and others, reducing “harmful score” by 7.60% and increasing accuracy by 3.44% on benchmarks such as AGNEWS under poisoning protocols.

This formulation requires access to both an aligned and unaligned model for direct computation of $u_\text{aligned}$ . The method is robust under model poisoning, sophisticated jailbreak attacks, and across architectures, though it is currently limited to text-only LLMs.

3. Parameter-Efficiency, Freezing, and Side-Network Constraints

Constrained FT can be realized by limiting the set of parameters that are updated, either via architectural freezing, dynamic adapter insertion, or additive side-network tuning:

Half Fine-Tuning (HFT) (Hui et al., 29 Apr 2024): Freezes exactly half of parameter categories in each transformer block, with the rest trainable. Parameter mask $M$ enforces $(I-M)(\theta-\theta^0) = 0$ , equivalently adding a hard regularization enforcing zero deviation in the frozen coordinates. HFT prevents catastrophic forgetting, achieves improved or matched in-domain performance, and brings up to $31.5\%$ training-time reduction.
Input-conditioned, parameter-efficient FT (iConFormer) (Jo et al., 4 Sep 2024): Adapts only $1.6$– $2.8\%$ of backbone parameters by injecting dynamic, sample-wise adapters after each MLP block in a transformer. The adapters use an input-conditioned network (iCoN) to dynamically generate per-instance convolutional kernels, yielding state-of-the-art performance under tight parameter budgets.
Mobile/offloading side-tuning (PAE MobiLLM) (Yang et al., 1 Jul 2025): Adapts only a small additive adapter entirely on a server, with the mobile device transmitting only a masked delta to protect privacy and a single-token activation to minimize communication. The server cannot recover the original labels or the fully adapted model due to a device-private random nonce.

Parameter-constrained FT thus enables not only hardware efficiency, but also privacy guarantees and adaptive compute allocation.

4. Fine-Tuning Under Bi-Level, Proximal, and Data-Driven Constraints

Modern approaches frequently cast constrained FT as an optimization problem over both objective function space and hyperparameters, rather than restricting to hand-designed regularization forms:

AutoFT: Bi-level Learning for Robustness (Choi et al., 18 Jan 2024): FT is formulated as a bi-level problem, with an inner minimization (weighted sum of candidate losses and regularizers on the task) and an outer maximization of OOD-validation performance. Black-box HPO (e.g., TPE) discovers the optimal balancing of terms and hyperparameters. AutoFT achieves a $+6.0\%$ gain on WILDS iWildCam (OOD macro-F1) over prior SOTA and consistently improves OOD robustness with minimal compute overhead.
Proximal Supervised Fine-Tuning (PSFT) (Zhu et al., 25 Aug 2025): Imposes a trust-region constraint inspired by PPO/TRL, clipping the policy update ratio to $[1-\epsilon, 1+\epsilon]$ per token:

$\mathcal{L}_\text{PSFT}(\theta) = -\mathbb{E}_{(s,a)\sim D}\left[\min\left\{\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)}, \mathrm{clip}\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)}, 1-\epsilon, 1+\epsilon\right)\right\}\right]$

PSFT matches or exceeds SFT in-domain, markedly improves OOD generalization, and preserves entropy under long training, with recommended $\epsilon=0.2$ –$0.28$.

These approaches demonstrate that effective constraint learning can be data-driven, dynamically adaptive, and practically feasible across a range of tasks and model sizes.

5. Linearly Constrained and Differentially Private FT

Specialized constraints arise in privacy-sensitive or high-stakes settings where minimizing update magnitude or parameter drift is essential:

Linear Probing then Fine-Tuning (LP-FT) (2405.16747, Ke et al., 29 Feb 2024): LP-FT conducts two phases—first, freezing the backbone and fitting only the linear head (LP), then unfreezing all parameters (FT) starting from the optimal head. NTK analysis reveals that large head norm from LP constrains feature change, resulting in robust performance. Temperature scaling corrects the sharper output calibration induced by LP's large head norm.

In DP fine-tuning (Ke et al., 29 Feb 2024), the LP-FT sequence is critical: LP suffers much less noise, while full-parameter FT is penalized under the privacy budget. For small $\varepsilon$ (tight privacy), LP alone is optimal; for large $\varepsilon$ , full FT is best; in-between, a two-stage LP-FT schedule, with privacy splits tuned by a grid search or analytic utility curve, yields maximal accuracy.

6. Capacity, Expressivity, and Theoretical Foundations

The theoretical underpinnings for capacity and expressivity in constrained FT have been formalized for additive side-network architectures:

Fine-Tuning Capacity (FTC) (Sohn et al., 1 Aug 2024): For a pre-trained (frozen) network with additive side-net $g_\theta$ , the FTC gives the maximal number of sample labels $N$ that can be overridden given the number of hidden units $m$ . For 2-layer ReLU adapters, $m^\star_\text{FTC}(N,K)\in\Theta(N)$ ; for 3-layer, $m^\star_\text{FTC}(N,K)=\Theta(\sqrt{N})$ (tight). For small $N$ , a shallow 3-layer net of width $O(\sqrt N)$ suffices; a 2-layer net requires $O(N)$ width.

This result guides adapter or side-network design in parameter-efficient FT under strict capacity constraints.

7. Vulnerabilities and Black-Box Constraints

Constrained FT does not guarantee robustness to adversarial data, especially under provider-enforced black-box or safety-constrained interfaces:

Black-box FT and Jailbreaks (Li et al., 1 Oct 2025): In a setting where the provider enforces input filters, defensive FT, and post-training audits, adversaries may use data- and model-blind strategies (e.g., safety-styled wrappers, lexical encoding, backdoor triggers) to inject covert harmful behavior. Such attacks succeed at >97% ASR on commercial LLMs with negligible utility drop, demonstrating that constraint mechanisms relying solely on superficial tokens or surface parallels are insufficient. Defense requires semantic-level filtering, diversified audits, and rigorous separation of defense phases.

Summary Table: Core Mechanisms in Constrained FT

Method	Core Constraint Type	Context/Goal
AsFT	Alignment-direction reg.	Safety alignment
HFT	Partial freezing (mask)	Forgetting mitigation, efficiency
iConFormer	Small dynamic adapters	Efficiency, instance specificity
PAE MobiLLM	Additive, server-side-only	Privacy and on-device FT
AutoFT	Bilevel, search/learned obj	OOD robustness
PSFT	Trust-region clipping	Stability, OOD retention
LP-FT (DP/NTK)	Linear head, staged FT	Privacy, feature preservation
FTC	Adapter width-depth bounds	Capacity, efficiency
Black-box/Defenses	Provider-stage filtering	Robustness, attack resistance

Future Directions

Key open research issues include:

Extending alignment-anchoring methods to multimodal and continual-learning settings (Yang et al., 10 Jun 2025);
Automating dynamic constraint learning (e.g., updating alignment subspaces online, multi-subspace anchoring, combinatorial bilevel search);
Designing semantic and non-surface-level safety audits for black-box constraints (Li et al., 1 Oct 2025);
Tightening theoretical characterization of memorization and generalization under architectural and regularization constraints (Sohn et al., 1 Aug 2024);
Provably efficient and robust on-device FT with composable (privacy, compute, safety) guarantees (Yang et al., 1 Jul 2025);
Integration of data- and parameter-space constraints for synergistic defense.

Constrained fine-tuning thus integrates theoretical analysis, principled regularization, and architectural and systems innovation to enable robust, safe, efficient, and private adaptation of foundation models.