Papers
Topics
Authors
Recent
Search
2000 character limit reached

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Published 8 Feb 2026 in cs.LG and cs.CL | (2602.07892v1)

Abstract: LLMs often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety--utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53\% to 3.03\% and IFEval from 51.94\% to 63.96\%. Our source code is available at \href{https://github.com/SunGL001/OGPSA}{OGPSA}

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of the paper’s unresolved issues and open directions that future work could concretely address:

  • Limited theoretical guarantees: OGPSA’s protection of capabilities is justified only via first-order Taylor approximations and a local “steepest feasible descent” result; there is no second-order, global, or cumulative guarantee on the alignment tax AtaxA_{\mathrm{tax}} over long training horizons.
  • Validity of the low-dimensional capability subspace assumption: The method relies on general capabilities being captured by a small, stable gradient subspace; this is not empirically validated across diverse tasks, models, and training stages (including whether intrinsic dimensionality remains low as optimization proceeds).
  • Subspace construction methodology: The choice of reference datasets D_ref, their sizes, domain coverage (e.g., reasoning, coding, multilingual, math), and the loss functions used to compute g^(i) are ad hoc; a principled, task-coverage-aware protocol for building S_gen(θ) is missing.
  • Sensitivity to hyperparameters: Key hyperparameters (rank M', refresh period K, Gram–Schmidt threshold δ) lack systematic sensitivity studies or guidance for robust selection across pipelines and models.
  • Numerical stability in large-scale settings: Orthogonal basis construction via Gram–Schmidt on gradients in d109d\gg10^9 dimensions may be numerically fragile (especially under mixed precision); stability vs. alternative bases (e.g., QR/SVD, randomized projections) is not evaluated.
  • Memory and compute overhead quantification: The paper claims negligible overhead but provides no detailed profiling of extra backward passes, storage of U∈ℝ^{d×M'}, and projection costs at 7B+, 70B+, or MoE scales, nor the trade-off between M' and runtime.
  • Optimizer interaction: Compatibility with common optimizers (AdamW, momentum, weight decay), optimizer states, gradient clipping, and mixed precision is not analyzed; projection may interact nontrivially with momentum/EMA updates.
  • RLHF integration beyond DPO: Despite “plug-and-play” claims, OGPSA is not demonstrated with PPO-style RLHF, KL-regularized objectives, or off-policy updates; how projection interacts with KL penalties and reward models remains unexplored.
  • Safety-utility trade-offs under adversarial pressure: Robustness to jailbreaks, prompt attacks, and adversarial distribution shifts (e.g., SaTML-style tests) is not assessed; OGPSA’s effect on attack surface is unknown.
  • Multiturn, long-context, and tool-use scenarios: Evaluation focuses on single-turn benchmarks; effects on dialog dynamics, multi-step reasoning, retrieval/tool-use pipelines, and planning are untested.
  • Domain breadth and multilinguality: Benchmarks largely center on a narrow set (e.g., SimpleQA, IFEval, HHH, MMLU); generalization to code generation, math problem solving, non-English languages, and domain-specific tasks is unverified.
  • Metrics reliability and interpretation: Extremely low SimpleQA percentages (e.g., 0.53–3.33%) raise questions about metric definition, dataset difficulty, or evaluation protocol; reproducible, calibrated measures of “truthfulness” and “helpfulness” need clarification.
  • Capability–safety entanglement: Safety-relevant gradients may legitimately overlap with capability subspaces; hard orthogonal projection could suppress needed safety learning in shared directions. The paper does not quantify or mitigate this risk.
  • Adaptive/partial projections: OGPSA enforces strict orthogonality; more nuanced schemes (e.g., soft constraints, per-dimension weighting, trust-region projections) to balance safety gain vs. capability retention are proposed but not developed or compared.
  • Dynamic subspace tracking: While periodic refreshes are ablated, principled strategies for when/how to update S_gen(θ) (e.g., curvature- or drift-triggered updates) are absent, as is analysis of lag-induced misalignment during fast-changing optimization.
  • Second-order or Fisher-based subspaces: Using Fisher information, Hessian eigenvectors, or influence functions might yield higher-fidelity capability manifolds; OGPSA does not compare against or integrate second-order information.
  • Continual learning baselines: The paper does not empirically compare against CL methods adapted to LLM alignment (e.g., EWC/LwF/DER/GEM/TRGP, parameter-efficient CL), leaving unclear whether OGPSA is superior in heterogeneous-objective settings.
  • Composition with PEFT and merging techniques: Interactions with adapters, prefix-tuning, IA3, or more advanced merging (e.g., different interpolation schedules) are undeveloped; potential synergies or conflicts are untested.
  • Subspace poisoning risks: If D_ref is biased or adversarially manipulated, S_gen(θ) could protect undesirable behaviors or block safety learning; defenses against data poisoning in reference gradient estimation are not discussed.
  • Deployment under drift: How OGPSA behaves with continuous post-deployment updates, online learning, or changing safety policies (organizational or jurisdictional) is unknown; protocols for maintaining S_gen(θ) over time are missing.
  • Fairness and bias impacts: Although “Stereotype” is included, a comprehensive fairness audit (across demographics, languages, domains) of OGPSA’s effect is absent; potential trade-offs between safety and fairness are unexamined.
  • Convergence behavior and training dynamics: Effects on safety loss convergence rates, stability, and potential oscillations due to projection are not measured; learning curves and variance across seeds are not reported.
  • Reproducibility and implementation details: Precise training settings (optimizer, schedules, batch sizes, precision), code accessibility, and full hyperparameter grids are incomplete or unspecified, impeding independent validation.
  • Upper and lower bounds on data efficiency: While small reference sets work in reported setups, minimum viable sample sizes and failure regimes (tasks where more data is required) are not mapped.
  • Scaling beyond 7B and to MoE: Empirical results stop at 7B; behavior at frontier scales (70B+, MoE architectures) and in distributed training is unknown, especially w.r.t. communication overhead for projecting high-dimensional gradients.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.