Papers
Topics
Authors
Recent
Search
2000 character limit reached

Safety-Preserving Fine-Tuning (SPF)

Updated 4 July 2026
  • Safety-Preserving Fine-Tuning (SPF) is a framework that adapts pre-aligned language models to new tasks without degrading built-in safety mechanisms.
  • It encompasses a range of interventions including pre-fine-tuning hardening, fine-tuning adjustments, and post-hoc repair to balance utility and safety.
  • Empirical studies show that conventional fine-tuning can increase harmful outputs, while SPF methods use strategies like KL regularization and subspace projections to maintain safety.

Safety-preserving fine-tuning (SPF) is the problem of adapting an already safety-aligned LLM to downstream tasks without eroding refusal behavior, harmful-content robustness, or related alignment properties. In the recent literature, SPF is treated not as a single algorithm but as a family of interventions spanning pre-fine-tuning hardening, modifications to the fine-tuning process itself, and post-hoc repair or realignment after adaptation. Across these works, the central empirical observation is consistent: benign supervised fine-tuning, parameter-efficient adaptation, and small harmful fine-tuning sets can all degrade safety in models that were previously aligned, including chat and instruct variants of Llama, Qwen, Gemma, Mistral, and MoE architectures (Djuhera et al., 21 Mar 2025, Perin et al., 18 Jun 2025, Zhang et al., 16 Oct 2025).

1. Scope, problem setting, and taxonomic structure

SPF assumes a starting point that is already aligned, typically an instruction-tuned or RLHF-tuned model such as Llama-2-7B-Chat, Qwen-2-7B-Instruct, Llama-3.1-8B-Instruct, or Gemma-2-9B-it. The downstream objective is standard adaptation: math reasoning, biomedical QA, summarization, instruction following, classification, or domain specialization. The constraint is that adaptation should not substantially increase harmful outputs on fixed safety benchmarks, reduce refusal rates, or distort the internal structures that support safe behavior (Djuhera et al., 21 Mar 2025, Liu et al., 24 Mar 2025, Goel et al., 19 Feb 2026).

The surveyed literature organizes SPF interventions at three stages. Some methods harden an aligned checkpoint before downstream training; some alter the fine-tuning process so that optimization remains close to a safe reference or preserves specific internal structures; others repair a task-fine-tuned model afterward through merging, low-rank fusion, delta editing, or second-order restoration (Djuhera et al., 21 Mar 2025, Perin et al., 18 Jun 2025, Lu et al., 17 May 2025).

Method Stage Core mechanism
LoX Pre-fine-tuning Low-rank extrapolation of alignment directions
LookAhead Tuning Fine-tuning-stage Prefix previews to reduce early-token drift
Adaptive Regularization Fine-tuning-stage Risk-conditioned KL to a safe reference
GuardSpace Fine-tuning-stage Safety-sensitive subspace freezing and harmful-resistant null-space projection
PACT Fine-tuning-stage KL regularization on safety-token confidence
SafeMoE Fine-tuning-stage Routing-alignment regularization for harmful inputs in MoE layers
SafeMERGE Post-fine-tuning Selective layer-wise merging with a safety adapter
Safe Delta Post-fine-tuning Delta selection under a safety budget plus compensation vector
LSSF Post-fine-tuning Low-rank safety-subspace fusion
Curvature-Aware Safety Restoration Post-fine-tuning Second-order restoration on harmful vs retain sets
RefusalGuard Fine-tuning-stage Geometry-preserving representation-level constraints

This taxonomy is not merely organizational. It reflects different assumptions about deployment. If the provider controls the training stack, fine-tuning-stage regularization is feasible. If only the final checkpoint is available, post-hoc restoration becomes the relevant SPF modality. If the objective is to release a model that remains robust under arbitrary future fine-tuning, pre-fine-tuning hardening becomes attractive (Perin et al., 18 Jun 2025, Djuhera et al., 21 Mar 2025).

2. Mechanistic accounts of safety degradation

A major theme in SPF research is that safety degradation is structured rather than arbitrary. Several papers argue that safety is encoded in specific subspaces, token distributions, routing patterns, or representation geometries, and that standard fine-tuning perturbs precisely those structures.

SafeMERGE starts from the observation that benign LoRA fine-tuning can substantially raise harmful output rates even on non-harmful task data. For Llama-2-7B-Chat fine-tuned on GSM8K, DirectHarm rises from 5.0% to 27.8% and HexPhi from 2.0% to 16.4%; for Qwen-2-7B-Instruct on GSM8K, DirectHarm rises from 18.2% to 25.3% and HexPhi from 11.5% to 16.8% (Djuhera et al., 21 Mar 2025). LookAhead Tuning gives a complementary account: safety degradation correlates with increased KL divergence on the first few output tokens, suggesting that early-token distributions carry much of the model’s refusal behavior (Liu et al., 24 Mar 2025).

LoX frames the same phenomenon in low-rank parameter geometry. It decomposes the alignment update ΔWaligni\Delta W_{\text{align}}^i with SVD and shows that a small number of top singular directions suffice to recover near-zero ASR; after benign fine-tuning, the relative prominence of this low-rank safety subspace shrinks, and higher Rft/RalignR_{\text{ft}}/R_{\text{align}} correlates with lower post-fine-tuning ASR (Perin et al., 18 Jun 2025). A mechanistic study of safety fine-tuning reaches a related but more activation-centric conclusion: supervised safety fine-tuning, DPO, and unlearning minimally transform late-layer MLP weights so that unsafe inputs are aligned into the weights’ null space, while jailbreaks succeed by shifting activations toward the cluster associated with safe samples (Jain et al., 2024).

Architecture-specific failure modes also appear. In MoE models, SafeMoE identifies safety routing drift: harmful inputs in safety-aligned MoE LLMs are normally routed to safety-critical experts, but fine-tuning changes those routing weights, and the KL divergence between pre- and post-fine-tuning routing distributions on harmful prompts correlates strongly with harmfulness, with reported Pearson coefficients r0.880.98r \approx 0.88 - 0.98 (Kim et al., 26 Sep 2025). RefusalGuard advances a representation-geometry account: standard fine-tuning reduces alignment with refusal subspaces, lowers projected magnitude in those subspaces, increases subspace drift and task-safety interference, and distorts refusal-cone geometry; these changes track rising ASR under harmful adaptation (Asif et al., 3 May 2026).

An optimization-centered perspective complicates any simple “inevitable trade-off” story. The optimization study on Llama families argues that much of the reported safety loss under benign fine-tuning is driven by unstable hyperparameters rather than an unavoidable utility–safety conflict. On Llama-2-Chat-7B with Dolly, naive fine-tuning yields 15.96% ASR, best-tuned training yields 4.62%, and EMA reduces ASR further to 2.70%, while utility remains competitive (Kim et al., 17 Aug 2025). Taken together, these results suggest that safety degradation can arise through multiple interacting mechanisms: early-token drift, low-rank subspace erosion, routing drift, representation-geometry distortion, and optimization instability.

3. Fine-tuning-stage SPF methods

Fine-tuning-stage SPF methods intervene directly in training, but they differ sharply in what they constrain: data prefixes, token distributions, gradients, hidden representations, routing weights, or parameter subspaces.

LookAhead Tuning is a data-centric method that leaves the cross-entropy objective unchanged but rewrites the training pairs. In the “Real Answer” variant, the prompt is augmented with the true first mm answer tokens; in the “Virtual Answer” variant, the prompt and target are prefixed with a fixed phrase such as “Let’s solve this problem.” The aim is to minimize drift in the initial output tokens, which the paper links to preserved refusal behavior. On LLaMA2-7B-Chat, vanilla fine-tuning drops average Raw Safe Rate from 99.39 to 82.88 and Jailbreak Safe Rate from 90.30 to 38.79, whereas LookAhead Tuning (Virtual) reaches 98.03 RSR and 59.55 JSR with utility 46.24 versus 47.83 for vanilla fine-tuning (Liu et al., 24 Mar 2025).

Adaptive regularization methods keep a frozen aligned reference policy and scale the KL penalty by a safety-risk signal. “Learning to Stay Safe” defines

Ltot(t)=αtLNLL(t)+βtLKL(t),\mathcal{L}_{\text{tot}}^{(t)} = \alpha_t \mathcal{L}_{\text{NLL}}^{(t)} + \beta_t \mathcal{L}_{\text{KL}}^{(t)},

with βt\beta_t interpolated between βmin=0.1\beta_{\min}=0.1 and βmax=0.9\beta_{\max}=0.9 using either a judge-based Safety Critic or an activation-based risk predictor. Under harmful fine-tuning on HEx-PHI, standard SFT pushes ASR to roughly 96–97% across multiple model families, whereas adaptive regularization keeps ASR around 1–9%, often near the initial aligned model, while maintaining GSM8K and Alpaca utility (Goel et al., 19 Feb 2026).

GuardSpace constrains LoRA at the level of linear-algebraic structure. It decomposes each linear layer with covariance-preconditioned SVD into safety-relevant and safety-irrelevant components using harmful activations, initializes LoRA adapters from the safety-irrelevant components, and multiplies the adapter update by a harmful-resistant null-space projector PP. The effective weight is Weff=W+BAP\mathbf{W}_{\text{eff}} = \mathbf{W}' + \mathbf{B}\mathbf{A}\mathbf{P}, with the design property that adapter updates vanish on harmful activations. On Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace reduces Harmful Score from 14.4% to 3.6% relative to AsFT while improving accuracy from 26.0% to 28.0% (Zhang et al., 16 Oct 2025).

PACT shifts the intervention locus from parameters to tokens. It first identifies a small set of “safety tokens” whose probabilities are much higher in the aligned model than in the corresponding base model on harmful prompts, then regularizes the fine-tuned model to match the aligned reference only on those token probabilities. Its final objective is

Rft/RalignR_{\text{ft}}/R_{\text{align}}0

On Qwen2.5-7B-Instruct fine-tuned on GSM8K with 10% harmful data, vanilla SFT attains 81.65% accuracy but 94.50% HarmBench ASR, whereas PACT keeps accuracy at 80.89% and reduces HarmBench ASR to 29.50%; on AGNews with 10% harmful data, PACT achieves 89.10% accuracy with HarmBench ASR 13.50% (Wang et al., 8 Mar 2026).

SAP targets gradients rather than outputs. It introduces hidden-state probes Rft/RalignR_{\text{ft}}/R_{\text{align}}1 into selected layers and defines a “safe-useful” loss

Rft/RalignR_{\text{ft}}/R_{\text{align}}2

where Rft/RalignR_{\text{ft}}/R_{\text{align}}3 is derived from a contrastive safety loss on harmful prompts with safe and harmful targets. Maximizing Rft/RalignR_{\text{ft}}/R_{\text{align}}4 with respect to Rft/RalignR_{\text{ft}}/R_{\text{align}}5 makes harmful directions less useful for task optimization; the paper gives a first-order relation Rft/RalignR_{\text{ft}}/R_{\text{align}}6. On Alpaca across Llama2-7B, Vicuna-7B, and Qwen2.5-7B, SAP reaches average BLEURT 0.519 and CL 5.54, essentially matching standard SFT, while lowering average Harmful Score to 23.07% versus 37.43% for SFT (Wu et al., 22 May 2025).

SafeMoE is explicitly architecture-aware. It regularizes the KL divergence between routing distributions of the aligned and fine-tuned MoE model on harmful prompts, thereby preserving routing of harmful inputs to safety-critical experts. On OLMoE fine-tuned for SAMSum, vanilla fine-tuning reaches harmfulness score 62.0, whereas SafeMoE lowers it to 5.0 with only 0.4 points of FA loss; on larger MoE models such as gpt-oss, Qwen3 MoE, Phi-3.5-MoE, Llama 4, and Mixtral, SafeMoE keeps harmfulness near aligned-model levels while largely recovering MMLU under strong harmful fine-tuning (Kim et al., 26 Sep 2025).

RefusalGuard also operates during adaptation but at the level of representation geometry. It freezes the base model, inserts low-rank intervention modules into hidden states, decomposes each targeted residual stream into a refusal subspace and its complement, and penalizes intervention components projected into the refusal subspace through a geometry-preservation term. Under a harmful-adaptation stress test with only 10 synthetic harmful examples, LLaMA-3.1-8B-Instruct with RefusalGuard reaches ASR 0.0085 on AdvBench, 0.0200 on DirectHarm4, and 0.0180 on JailbreakBench, compared with 0.7050, 0.6900, and 0.7500 for LoRA (Asif et al., 3 May 2026).

A simpler optimization-centric line of SPF argues that specialized safety objectives are not always necessary if optimization remains inside the model’s “safety basin.” On Llama-2-Chat-7B, tuning the learning rate, batch size, and accumulation steps reduces Dolly ASR from 15.96% to 4.62% while slightly improving MT-Bench utility, and an EMA of model parameters lowers ASR further to 2.70% (Kim et al., 17 Aug 2025). This does not eliminate the broader SPF problem, but it shows that optimization design itself can function as a safety-preserving intervention.

4. Pre-fine-tuning hardening and post-hoc restoration

A second major branch of SPF avoids modifying the main task-training loop and instead hardens the aligned model in advance or repairs it after downstream fine-tuning.

LoX is a pre-emptive, training-free method. It computes the alignment delta between a base and aligned model, extracts a top-Rft/RalignR_{\text{ft}}/R_{\text{align}}7 safety subspace by SVD, and extrapolates along that subspace: Rft/RalignR_{\text{ft}}/R_{\text{align}}8 The idea is to move the aligned model deeper into a flatter, low-ASR basin before any user fine-tuning. On LLaMA-2-7B aligned with 65.6k HH-RLHF examples, LoX reduces post-fine-tuning ASR from 11% to 0% on GSM8K, from 52% to 7% on Dolly, from 32% to 9% on Alpaca, from 84.3% to 42.3% under Identity Shifting, and from 63% to 9% under Pure Bad, with utility costs described as small (Perin et al., 18 Jun 2025).

SafeMERGE is a post-fine-tuning adapter-merging framework. It trains a utility LoRA and a safety LoRA from the same aligned base, builds a per-layer safety-aligned subspace from the difference between aligned and unaligned weights, computes a cosine similarity

Rft/RalignR_{\text{ft}}/R_{\text{align}}9

and merges only layers whose task update deviates too far from that subspace. On Qwen-2-7B-Instruct fine-tuned on GSM8K, SafeMERGE improves utility from 70.13% to 72.90% while lowering DirectHarm from 25.3% to 8.2% and HexPhi from 16.8% to 7.5%; on Qwen-2-7B-Instruct fine-tuned on PubMedQA, it raises accuracy from 79.6% to 80.3% while reducing DirectHarm from 26.0% to 8.5% and HexPhi from 13.2% to 5.9% (Djuhera et al., 21 Mar 2025).

Safe Delta is also post-hoc but works at the parameter-delta level. It writes the defended model as

r0.880.98r \approx 0.88 - 0.980

where r0.880.98r \approx 0.88 - 0.981 is a mask chosen under a safety budget derived from a Hessian on a safety dataset, and r0.880.98r \approx 0.88 - 0.982 is a safety compensation vector. On Dirty Summary fine-tuning of Llama-2-7B-Chat, vanilla fine-tuning reaches utility 0.491, ASR 63.94%, and Harmfulness Score 3.36; Safe Delta preserves utility at 0.489 while lowering ASR to 5.15% and Harmfulness Score to 1.19. On the Math setting, it preserves utility at 0.334 versus 0.337 for vanilla fine-tuning while reducing ASR from 11.52% to 3.33% (Lu et al., 17 May 2025).

LSSF treats safety as a reusable low-rank module. It constructs a safety vector r0.880.98r \approx 0.88 - 0.983, extracts low-rank principal components through layerwise SVD and a safety singular value entropy criterion, and fuses only those components back into downstream-fine-tuned models. On AG’s News with Qwen2.5-7B, LSSF reaches accuracy 0.92 while achieving refusal rates 1.00 on AdvBench, 0.98 on HarmfulQA, and 0.93 on CATQA; on Llama3.1-8B, AG’s News accuracy is 0.85 with refusal rates 1.00 on all three benchmarks (Zhou et al., 19 Jan 2026).

Curvature-Aware Safety Restoration assumes that harmful-data loss geometry is largely preserved after task fine-tuning and uses an influence-style update

r0.880.98r \approx 0.88 - 0.984

over LoRA parameters to raise loss on harmful examples while constraining retain loss on benign data. On LLaMA-3.1-8B Instruct, vanilla LoRA on Dolly yields 25.5% Harmful Response Rate on AdvBench, whereas curvature-aware restoration reduces HRR to 3.0% while keeping Eval loss near the LoRA model and improving some utility benchmarks such as GSM8K and TruthfulQA relative to competing safety-preserving baselines (Bach et al., 22 Nov 2025).

These post-hoc methods collectively show that SPF need not be synonymous with changing the training objective. A plausible implication is that safety-relevant structure often survives fine-tuning in weakened, rotated, or partially suppressed form, so carefully designed post-training edits can reactivate it without discarding downstream capability.

5. Evaluation regimes and recurring empirical patterns

The SPF literature does not rely on a single benchmark family or judge. Safety is operationalized through harmful output rates, refusal rates, attack success rates, Raw Safe Rate, Jailbreak Safe Rate, Harmful Score, and Harmfulness Rate. Common benchmark suites include DirectHarm and HexPhi for harmful prompt evaluation; AdvBench, HEx-PHI, HarmfulQA, CATQA, HarmBench, StrongReject, and JailbreakBench; and MoE-specific evaluations built on JailbreakBench and HEx-PHI (Djuhera et al., 21 Mar 2025, Liu et al., 24 Mar 2025, Goel et al., 19 Feb 2026, Kim et al., 26 Sep 2025, Wang et al., 8 Mar 2026). Judging is similarly heterogeneous: Llama-Guard-3-8B, Llama-Guard-4-12B, GPT-4, GPT-4o mini, gpt-oss-20B, keyword-based refusal detectors, and BeaverTails moderation all appear in the surveyed papers (Djuhera et al., 21 Mar 2025, Goel et al., 19 Feb 2026, Kim et al., 17 Aug 2025, Wu et al., 22 May 2025).

Utility metrics are likewise task-dependent. GSM8K and PubMedQA use accuracy or exact match; SAMSum uses ROUGE-1; Dirty Summary uses ROUGE-1 F1; instruction-following studies report BLEURT, MT-Bench, Alpaca-Eval, or OpenOrca metrics; broader knowledge evaluations include MMLU, ARC-Challenge, BBH, TruthfulQA, and IFEval (Djuhera et al., 21 Mar 2025, Lu et al., 17 May 2025, Kim et al., 17 Aug 2025, Zhou et al., 19 Jan 2026, Asif et al., 3 May 2026). This heterogeneity matters because two SPF methods may both “improve safety” while optimizing very different notions of failure.

Despite that metric diversity, a few empirical regularities recur. First, benign fine-tuning often degrades safety even without harmful data. SafeMERGE documents large DirectHarm and HexPhi increases after GSM8K fine-tuning on Llama-2-7B-Chat and Qwen-2-7B-Instruct (Djuhera et al., 21 Mar 2025); LookAhead Tuning records severe RSR and JSR drops under vanilla fine-tuning on GSM8K and SAMSum (Liu et al., 24 Mar 2025); adaptive regularization, PACT, and RefusalGuard all report safety erosion after benign training unless additional constraints are applied (Goel et al., 19 Feb 2026, Wang et al., 8 Mar 2026, Asif et al., 3 May 2026). Second, several methods recover safety to near-original levels while preserving most or all of the downstream gain. SafeMERGE sometimes improves task performance beyond the vanilla fine-tuned model (Djuhera et al., 21 Mar 2025), LoX preserves or improves helpfulness while reducing ASR (Perin et al., 18 Jun 2025), Safe Delta keeps ROUGE or GSM8K utility essentially unchanged while restoring safety (Lu et al., 17 May 2025), and PACT preserves accuracy within a narrow margin of SFT while substantially reducing ASR (Wang et al., 8 Mar 2026). Third, architecture-aware defenses matter. SafeMoE shows that dense-model defenses are less effective on MoE architectures because they do not explicitly preserve routing to safety-critical experts (Kim et al., 26 Sep 2025).

One further pattern is that safety restoration can outperform simpler global constraints by exploiting finer structure. Projection onto a safety subspace, uniform KL regularization, or full-model averaging often imposes a larger utility tax than selective layer merging, token-local constraints, routing-aligned regularization, or representation-geometry preservation. This suggests that SPF is often easier when the intervention matches the granularity of the failure mode.

6. Limitations, misconceptions, and open directions

A recurring limitation across SPF methods is dependence on artifacts that may be unavailable in some deployments. Many methods require a matched base/aligned checkpoint pair, as in LoX, SafeMERGE, and LSSF; others need a dedicated safety dataset or harmful prompt set, as in Safe Delta, GuardSpace, SafeMoE, SAP, and adaptive regularization; others still assume access to hidden states, routing logits, or the ability to insert representation-level modules (Perin et al., 18 Jun 2025, Djuhera et al., 21 Mar 2025, Zhou et al., 19 Jan 2026, Zhang et al., 16 Oct 2025, Kim et al., 26 Sep 2025). Closed API models, proprietary alignment pipelines, or nonstandard architectures can therefore block direct application.

Another limitation is that most methods use simplified structural assumptions. SafeMERGE’s layerwise safety-aligned subspace is rank-1 per layer (Djuhera et al., 21 Mar 2025); LoX assumes that safety alignment is concentrated in a handful of singular directions (Perin et al., 18 Jun 2025); LSSF assumes a low-rank safety subspace that is stable and separable from general capability (Zhou et al., 19 Jan 2026); GuardSpace assumes that harmful activations admit a useful null-space decomposition (Zhang et al., 16 Oct 2025). These assumptions are empirically productive, but they remain approximations.

Evaluation coverage is also incomplete. Several papers note that their safety metrics are limited to specific harmful-prompt suites and judge models, leaving broader issues such as subtle persuasion, misinformation, bias, or over-refusal underexplored (Djuhera et al., 21 Mar 2025, Goel et al., 19 Feb 2026, Bach et al., 22 Nov 2025, Asif et al., 3 May 2026). PACT, for example, stabilizes refusal-token confidence, but a plausible implication is that a token-local refusal mechanism may not exhaust all forms of safe behavior (Wang et al., 8 Mar 2026). RefusalGuard and mechanistic studies centered on refusal geometry are explicit about focusing on refusal rather than the full space of safety properties (Asif et al., 3 May 2026, Jain et al., 2024).

A further misconception in the literature is that benign fine-tuning necessarily destroys safety through an irreducible utility–safety trade-off. Multiple papers do document substantial benign-fine-tuning degradation, but the optimization study argues that poor hyperparameter choices, rather than an inherent conflict, often account for much of the effect; conservative optimization and parameter EMA reduce ASR markedly while maintaining utility (Kim et al., 17 Aug 2025). The most defensible synthesis is therefore narrower: safety degradation under adaptation is common and often severe, but it is neither uniform across optimization regimes nor resistant to targeted intervention.

Open directions are correspondingly broad. The surveyed papers point toward richer safety subspaces, multi-dimensional safety geometry, cross-model transfer of safety modules, online monitoring of safety drift during training, multimodal and MoE extensions, and combinations of pre-fine-tuning hardening, in-training constraints, and post-hoc repair (Perin et al., 18 Jun 2025, Zhou et al., 19 Jan 2026, Kim et al., 26 Sep 2025, Asif et al., 3 May 2026). Taken together, this suggests that SPF is evolving from a narrow alignment-retention problem into a more general study of how safety is represented, perturbed, and recoverable across optimization, token generation, routing, and activation geometry.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Safety-Preserving Fine-Tuning (SPF).