Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Published 8 Feb 2026 in cs.LG and cs.CL | (2602.07892v1)

Abstract: LLMs often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety--utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53\% to 3.03\% and IFEval from 51.94\% to 63.96\%. Our source code is available at \href{https://github.com/SunGL001/OGPSA}{OGPSA}

Abstract PDF Upgrade to Chat

Authors (6)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of the paper’s unresolved issues and open directions that future work could concretely address:

Limited theoretical guarantees: OGPSA’s protection of capabilities is justified only via first-order Taylor approximations and a local “steepest feasible descent” result; there is no second-order, global, or cumulative guarantee on the alignment tax $A_{\mathrm{tax}}$ over long training horizons.
Validity of the low-dimensional capability subspace assumption: The method relies on general capabilities being captured by a small, stable gradient subspace; this is not empirically validated across diverse tasks, models, and training stages (including whether intrinsic dimensionality remains low as optimization proceeds).
Subspace construction methodology: The choice of reference datasets D_ref, their sizes, domain coverage (e.g., reasoning, coding, multilingual, math), and the loss functions used to compute g^(i) are ad hoc; a principled, task-coverage-aware protocol for building S_gen(θ) is missing.
Sensitivity to hyperparameters: Key hyperparameters (rank M', refresh period K, Gram–Schmidt threshold δ) lack systematic sensitivity studies or guidance for robust selection across pipelines and models.
Numerical stability in large-scale settings: Orthogonal basis construction via Gram–Schmidt on gradients in $d\gg10^9$ dimensions may be numerically fragile (especially under mixed precision); stability vs. alternative bases (e.g., QR/SVD, randomized projections) is not evaluated.
Memory and compute overhead quantification: The paper claims negligible overhead but provides no detailed profiling of extra backward passes, storage of U∈ℝ^{d×M'}, and projection costs at 7B+, 70B+, or MoE scales, nor the trade-off between M' and runtime.
Optimizer interaction: Compatibility with common optimizers (AdamW, momentum, weight decay), optimizer states, gradient clipping, and mixed precision is not analyzed; projection may interact nontrivially with momentum/EMA updates.
RLHF integration beyond DPO: Despite “plug-and-play” claims, OGPSA is not demonstrated with PPO-style RLHF, KL-regularized objectives, or off-policy updates; how projection interacts with KL penalties and reward models remains unexplored.
Safety-utility trade-offs under adversarial pressure: Robustness to jailbreaks, prompt attacks, and adversarial distribution shifts (e.g., SaTML-style tests) is not assessed; OGPSA’s effect on attack surface is unknown.
Multiturn, long-context, and tool-use scenarios: Evaluation focuses on single-turn benchmarks; effects on dialog dynamics, multi-step reasoning, retrieval/tool-use pipelines, and planning are untested.
Domain breadth and multilinguality: Benchmarks largely center on a narrow set (e.g., SimpleQA, IFEval, HHH, MMLU); generalization to code generation, math problem solving, non-English languages, and domain-specific tasks is unverified.
Metrics reliability and interpretation: Extremely low SimpleQA percentages (e.g., 0.53–3.33%) raise questions about metric definition, dataset difficulty, or evaluation protocol; reproducible, calibrated measures of “truthfulness” and “helpfulness” need clarification.
Capability–safety entanglement: Safety-relevant gradients may legitimately overlap with capability subspaces; hard orthogonal projection could suppress needed safety learning in shared directions. The paper does not quantify or mitigate this risk.
Adaptive/partial projections: OGPSA enforces strict orthogonality; more nuanced schemes (e.g., soft constraints, per-dimension weighting, trust-region projections) to balance safety gain vs. capability retention are proposed but not developed or compared.
Dynamic subspace tracking: While periodic refreshes are ablated, principled strategies for when/how to update S_gen(θ) (e.g., curvature- or drift-triggered updates) are absent, as is analysis of lag-induced misalignment during fast-changing optimization.
Second-order or Fisher-based subspaces: Using Fisher information, Hessian eigenvectors, or influence functions might yield higher-fidelity capability manifolds; OGPSA does not compare against or integrate second-order information.
Continual learning baselines: The paper does not empirically compare against CL methods adapted to LLM alignment (e.g., EWC/LwF/DER/GEM/TRGP, parameter-efficient CL), leaving unclear whether OGPSA is superior in heterogeneous-objective settings.
Composition with PEFT and merging techniques: Interactions with adapters, prefix-tuning, IA3, or more advanced merging (e.g., different interpolation schedules) are undeveloped; potential synergies or conflicts are untested.
Subspace poisoning risks: If D_ref is biased or adversarially manipulated, S_gen(θ) could protect undesirable behaviors or block safety learning; defenses against data poisoning in reference gradient estimation are not discussed.
Deployment under drift: How OGPSA behaves with continuous post-deployment updates, online learning, or changing safety policies (organizational or jurisdictional) is unknown; protocols for maintaining S_gen(θ) over time are missing.
Fairness and bias impacts: Although “Stereotype” is included, a comprehensive fairness audit (across demographics, languages, domains) of OGPSA’s effect is absent; potential trade-offs between safety and fairness are unexamined.
Convergence behavior and training dynamics: Effects on safety loss convergence rates, stability, and potential oscillations due to projection are not measured; learning curves and variance across seeds are not reported.
Reproducibility and implementation details: Precise training settings (optimizer, schedules, batch sizes, precision), code accessibility, and full hyperparameter grids are incomplete or unspecified, impeding independent validation.
Upper and lower bounds on data efficiency: While small reference sets work in reported setups, minimum viable sample sizes and failure regimes (tasks where more data is required) are not mapped.
Scaling beyond 7B and to MoE: Empirical results stop at 7B; behavior at frontier scales (70B+, MoE architectures) and in distributed training is unknown, especially w.r.t. communication overhead for projecting high-dimensional gradients.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Collections

GitHub

GitHub - SunGL001/OGPSA: Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection