The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

Published 17 Feb 2026 in cs.LG and cs.AI | (2602.15799v1)

Abstract: Fine-tuning aligned LLMs on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the Alignment Instability Condition (AIC), showing that fine-tuning leads to safety degradation via curvature-induced drift into alignment-sensitive subspaces.
Empirical analyses reveal that overlap between fine-tuning updates and low-rank FIM eigenvectors correlates strongly with increased harmfulness scores across diverse datasets.
The study provides actionable insights, recommending curvature-aware fine-tuning and continuous monitoring of projection dynamics to mitigate alignment collapse risks.

Alignment Collapse Geometry in Fine-Tuned LLMs

Overview and Motivation

The paper "The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety" (2602.15799) systematically investigates the geometric underpinnings of safety degradation in LLMs following task-specific fine-tuning. While prevailing intuition presumes that benign task adaptation should not undermine unrelated alignment guardrails, empirical evidence consistently exposes catastrophic safety failures—even when fine-tuning datasets are non-adversarial. The authors resolve the paradox by demonstrating that, in high-dimensional parameter spaces, orthogonality between update directions and alignment-sensitive subspaces is unstable under gradient descent dynamics. Specifically, second-order curvature of the fine-tuning loss landscape enforces drift into alignment-critical directions, rendering safety collapse inevitable.

Formalizing the Alignment Instability Condition

The central theoretical contribution is the Alignment Instability Condition (AIC), which encapsulates three geometric properties:

Low-Rank Sensitivity: Alignment for a given skill is concentrated in a small, high-curvature subspace, defined by the leading eigenvectors of the skill-specific Fisher Information Matrix (FIM).
Initial Orthogonality: Fine-tuning gradients for unrelated tasks possess negligible first-order projection onto the alignment-sensitive subspace, seemingly ensuring safety.
Curvature Coupling: The second-order terms in the fine-tuning trajectory—quantified by the directional derivative of the gradient—introduce nontrivial acceleration into the sensitive subspace, instigating alignment drift regardless of initial orthogonality.

Under the AIC, the authors rigorously prove that fine-tuning induces quadratic growth of projection into the alignment-sensitive subspace, which in turn yields a quartic scaling law for alignment degradation: skill utility loss $\Delta u_i$ scales as $\Omega(\lambda\gamma^2 t^4)$ in early training, where $\lambda$ is the curvature (eigenvalue), and $\gamma$ quantifies coupling.

Empirical Validation: Low-Rank Structure and Overlap Dynamics

To validate the theoretical premises, the paper presents an extensive empirical analysis of the FIM and fine-tuning trajectories in weight space. The FIM, computed block-wise for transformer modules, exhibits sharply decaying eigenvalue spectra, confirming that alignment-sensitive structure is inherently low-rank.

Figure 1: Top eigenvalues of FIM approximated over 100 random samples from BeaverTail's safe subset; low-rank structure is consistent across multiple layers.

Next, the authors introduce the Overlap Score (OS), a metric capturing the geometric projection of fine-tuning-induced weight changes onto the alignment-sensitive subspace. OS quantifies the risk of safety degradation from task adaptation by measuring the coupling between update trajectories and the skill-specific FIM.

Figure 2: Average Overlap Score per Transformer Block for seven fine-tuning datasets; datasets with high overlap induce substantial alignment degradation.

Granular per-module analysis further reveals that certain components (e.g., query and key matrices) exhibit greater entanglement with alignment-sensitive directions.

Figure 3: Per-module Overlap Score per Transformer Block for seven fine-tuning datasets, highlighting module-specific entanglement.

Numerical and Experimental Results

The experimental setup evaluates both LoRA and full fine-tuning on Qwen3-1.7B and LLaMA-3.2-3B using diverse datasets: strictly benign, seemingly benign (adversarially selected but semantically safe), and explicitly harmful. Evaluation via Gemini-2.5-Flash on AdvBench queries demonstrates the following:

All fine-tuning increases harmfulness scores relative to the base model.
Seemingly benign datasets (e.g., Alpaca Top 100, Risky Financial Advice) exhibit harmfulness scores comparable to explicitly harmful datasets.
The magnitude of safety degradation correlates strongly with geometric overlap (OS) in full fine-tuning; this correlation is less consistent for LoRA, presumably due to shifts induced by low-rank adaptation structure ("intruder dimensions").

Theoretical Implications and Failure Modes of Existing Defenses

The quartic scaling law derived from the AIC provides a principled explanation for empirically observed rapid and thresholded safety collapse. First-order defenses (e.g., null-space projection, gradient constraints) are insufficient because they ignore the non-Euclidean, dynamic curvature of parameter space. Even small curvature coupling accelerates drift into alignment-sensitive directions, invalidating static geometric intuitions.

A critical insight is that alignment fragility is not a product of adversarial intent or dataset overlap but an intrinsic geometric property of high-dimensional optimization. The failure of traditional meta-learning guarantees, shared representation bounds, and null-space methods stems from their neglect of curvature-induced rotational dynamics.

Practical Recommendations and Prospective Risk Assessment

The geometric framework established herein enables proactive diagnostic tools:

Estimating Coupling ( $\gamma$ ): Practitioners can compute the curvature coupling parameter pre-training to gauge alignment risk.
Monitoring Projections During Training: Early warning signals can be extracted by tracking projection size into alignment-sensitive subspaces.
Curvature-Aware Fine-Tuning: Preservation of alignment demands tracking of dynamically evolving sensitive subspaces and constraints on second-order acceleration.
Figure 4: Top eigenvalues of FIM approximated for multiple modules and layers; eigenvalue spectra reinforce the universality of low-rank structure across the model.

Conclusion

This paper establishes that alignment collapse is an inevitable consequence of the geometric interaction between task fine-tuning trajectories and high-curvature, low-dimensional alignment-sensitive regions of the parameter space. The AIC rigorously characterizes the mechanisms, demonstrating rapid ( $t^4$ ) scaling of safety degradation even for benign tasks. Future strategies for robust alignment preservation must eschew static first-order intuitions and adopt curvature-aware geometric controls. The implications are substantial for open-weight model deployment and foundational safety frameworks, necessitating a shift from engineering solutions to intrinsic geometric solutions within high-dimensional neural network optimization.