Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
Abstract: LLMs often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety--utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53\% to 3.03\% and IFEval from 51.94\% to 63.96\%. Our source code is available at \href{https://github.com/SunGL001/OGPSA}{OGPSA}
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of the paper’s unresolved issues and open directions that future work could concretely address:
- Limited theoretical guarantees: OGPSA’s protection of capabilities is justified only via first-order Taylor approximations and a local “steepest feasible descent” result; there is no second-order, global, or cumulative guarantee on the alignment tax over long training horizons.
- Validity of the low-dimensional capability subspace assumption: The method relies on general capabilities being captured by a small, stable gradient subspace; this is not empirically validated across diverse tasks, models, and training stages (including whether intrinsic dimensionality remains low as optimization proceeds).
- Subspace construction methodology: The choice of reference datasets
D_ref, their sizes, domain coverage (e.g., reasoning, coding, multilingual, math), and the loss functions used to computeg^(i)are ad hoc; a principled, task-coverage-aware protocol for buildingS_gen(θ)is missing. - Sensitivity to hyperparameters: Key hyperparameters (rank
M', refresh periodK, Gram–Schmidt thresholdδ) lack systematic sensitivity studies or guidance for robust selection across pipelines and models. - Numerical stability in large-scale settings: Orthogonal basis construction via Gram–Schmidt on gradients in dimensions may be numerically fragile (especially under mixed precision); stability vs. alternative bases (e.g., QR/SVD, randomized projections) is not evaluated.
- Memory and compute overhead quantification: The paper claims negligible overhead but provides no detailed profiling of extra backward passes, storage of
U∈ℝ^{d×M'}, and projection costs at 7B+, 70B+, or MoE scales, nor the trade-off betweenM'and runtime. - Optimizer interaction: Compatibility with common optimizers (AdamW, momentum, weight decay), optimizer states, gradient clipping, and mixed precision is not analyzed; projection may interact nontrivially with momentum/EMA updates.
- RLHF integration beyond DPO: Despite “plug-and-play” claims, OGPSA is not demonstrated with PPO-style RLHF, KL-regularized objectives, or off-policy updates; how projection interacts with KL penalties and reward models remains unexplored.
- Safety-utility trade-offs under adversarial pressure: Robustness to jailbreaks, prompt attacks, and adversarial distribution shifts (e.g., SaTML-style tests) is not assessed; OGPSA’s effect on attack surface is unknown.
- Multiturn, long-context, and tool-use scenarios: Evaluation focuses on single-turn benchmarks; effects on dialog dynamics, multi-step reasoning, retrieval/tool-use pipelines, and planning are untested.
- Domain breadth and multilinguality: Benchmarks largely center on a narrow set (e.g., SimpleQA, IFEval, HHH, MMLU); generalization to code generation, math problem solving, non-English languages, and domain-specific tasks is unverified.
- Metrics reliability and interpretation: Extremely low SimpleQA percentages (e.g., 0.53–3.33%) raise questions about metric definition, dataset difficulty, or evaluation protocol; reproducible, calibrated measures of “truthfulness” and “helpfulness” need clarification.
- Capability–safety entanglement: Safety-relevant gradients may legitimately overlap with capability subspaces; hard orthogonal projection could suppress needed safety learning in shared directions. The paper does not quantify or mitigate this risk.
- Adaptive/partial projections: OGPSA enforces strict orthogonality; more nuanced schemes (e.g., soft constraints, per-dimension weighting, trust-region projections) to balance safety gain vs. capability retention are proposed but not developed or compared.
- Dynamic subspace tracking: While periodic refreshes are ablated, principled strategies for when/how to update
S_gen(θ)(e.g., curvature- or drift-triggered updates) are absent, as is analysis of lag-induced misalignment during fast-changing optimization. - Second-order or Fisher-based subspaces: Using Fisher information, Hessian eigenvectors, or influence functions might yield higher-fidelity capability manifolds; OGPSA does not compare against or integrate second-order information.
- Continual learning baselines: The paper does not empirically compare against CL methods adapted to LLM alignment (e.g., EWC/LwF/DER/GEM/TRGP, parameter-efficient CL), leaving unclear whether OGPSA is superior in heterogeneous-objective settings.
- Composition with PEFT and merging techniques: Interactions with adapters, prefix-tuning, IA3, or more advanced merging (e.g., different interpolation schedules) are undeveloped; potential synergies or conflicts are untested.
- Subspace poisoning risks: If
D_refis biased or adversarially manipulated,S_gen(θ)could protect undesirable behaviors or block safety learning; defenses against data poisoning in reference gradient estimation are not discussed. - Deployment under drift: How OGPSA behaves with continuous post-deployment updates, online learning, or changing safety policies (organizational or jurisdictional) is unknown; protocols for maintaining
S_gen(θ)over time are missing. - Fairness and bias impacts: Although “Stereotype” is included, a comprehensive fairness audit (across demographics, languages, domains) of OGPSA’s effect is absent; potential trade-offs between safety and fairness are unexamined.
- Convergence behavior and training dynamics: Effects on safety loss convergence rates, stability, and potential oscillations due to projection are not measured; learning curves and variance across seeds are not reported.
- Reproducibility and implementation details: Precise training settings (optimizer, schedules, batch sizes, precision), code accessibility, and full hyperparameter grids are incomplete or unspecified, impeding independent validation.
- Upper and lower bounds on data efficiency: While small reference sets work in reported setups, minimum viable sample sizes and failure regimes (tasks where more data is required) are not mapped.
- Scaling beyond 7B and to MoE: Empirical results stop at 7B; behavior at frontier scales (70B+, MoE architectures) and in distributed training is unknown, especially w.r.t. communication overhead for projecting high-dimensional gradients.
Collections
Sign up for free to add this paper to one or more collections.