Papers
Topics
Authors
Recent
Search
2000 character limit reached

Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis

Published 7 Jan 2026 in cs.LG and cs.AI | (2601.04262v1)

Abstract: Safety alignment in LLMs inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, which then guides the selective update of parameters. Experiments reveal that alignment conflicts in LLMs are not uniformly distributed. We find that the drop in general capabilities mainly comes from updating a small group of ``high-conflict'' heads. By simply skipping these heads during training, we significantly reduce this loss without compromising safety, offering an interpretable and parameter-efficient approach to improving the safety-utility trade-off.

Summary

  • The paper introduces CAST, a framework that leverages detailed head-level diagnostics to precisely identify high-conflict attention heads affecting LLM safety and utility.
  • It employs a unified conflict score combining gradient opposition and functional sensitivity, enabling budget-matched sparse fine-tuning for robust alignment.
  • Empirical results on multiple LLMs demonstrate enhanced Pareto efficiency, with improved MMLU accuracy and reduced alignment-induced capability loss.

Head-Level Structural Diagnosis for Safety-Utility Alignment in LLMs

Introduction

The systematic degradation of general capabilities during safety alignment procedures in LLMs—commonly termed the "alignment tax"—poses an intrinsic multi-objective optimization challenge. Prevailing mitigation strategies treat the transformer architecture as a homogeneous monolith, leveraging global geometric approaches such as gradient projection (e.g., PCGrad) or direction balancing (e.g., CAGrad). These methods fail to exploit the modular heterogeneity revealed by mechanistic interpretability research, specifically the functional specialization and idiosyncratic sensitivity of individual attention heads. This work introduces Conflict-Aware Sparse Tuning (CAST), a framework that leverages head-level conflict analysis for surgical, parameter-efficient safety alignment via sparse fine-tuning.

Modular Characterization of Alignment Conflict

Through pre-alignment head-level diagnostic probing, the study demonstrates that the safety-utility trade-off is fundamentally structural rather than global. Alignment-induced capability degradation is principally attributed to a localized subset of "high-conflict" heads—those with strong geometric gradient opposition and high functional sensitivity to utility objectives. By constructing a head-wise conflict map that combines gradient direction antagonism (optimization conflict) and causal load (functional sensitivity), CAST enables precise identification of risk-intensive regions within the model.

CAST Head-Level Conflict Metric

The unified conflict score for each head hh is formulated as

C(h)=O(h)S(h)C(h) = O(h) \cdot S(h)

where O(h)O(h) quantifies the geometric opposition between safety and utility gradients (measured via normalized cosine distance) and S(h)S(h) captures functional sensitivity via zero-shot ablation impacts on utility and safety behaviors. The exponential scaling in the sensitivity definition emphasizes heads with high utility dependency but low safety impact.

Sparse Selective Fine-Tuning Protocol

The practical regime for intervention is a "budget-matched" sparse fine-tuning, where the total number of updated parameters remains fixed, but the trainable heads are selected according to the conflict ranking. CAST partitions heads into buckets by conflict score, empirically validating that freezing high-conflict heads (risk zone) while updating low-conflict heads (safe zone) yields robust safety improvement with minimal utility loss. The protocol ensures that observed performance variances are causally associated with structural head selection rather than parameter count.

Empirical Validation and Numerical Results

Experiments on Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, and Mistral-7B-v0.2 confirm three principal findings:

  • Conflict Localization: High conflict is spatially sparse and predominantly situated in middle-to-deep layers, with only a minority of attention heads implicated.
  • Superior Pareto Efficiency: CAST-SFT (Safe Zone) consistently outperforms Full-SFT and random selection baselines, e.g., on Llama, yielding a +9.45% gain in MMLU accuracy while maintaining or improving safety metrics.
  • Diagnostic Predictiveness: The head-level conflict score is highly predictive of post-alignment capability losses (Pearson r[0.73,1.00]r\in[0.73,1.00] for utility cost ratio), supporting its efficacy as a practical tool for risk anticipation and mitigation.

Combining head selection with geometric alignment methods (e.g., CAST + PCGrad) further extends the Pareto frontier, demonstrating the orthogonal benefits of modular diagnosis and geometric optimization.

Analysis of Failure Modes

Qualitative investigation exposes characteristic failures of global alignment methods: (1) Collapsed reasoning pathways (loss of CoT capability) and (2) Safety over-refusal (misclassification of benign queries as unsafe). CAST, by bypassing high-conflict heads, demonstrably prevents both forms of degradation while maintaining precise boundaries for safety refusal.

Robustness, Ablation, and Limitations

Robustness analyses confirm that sparse calibration (e.g., 100 samples per domain) suffices for accurate conflict mapping, and that the performance peak is sharply concentrated in the lowest-conflict (Bottom-25%) bucket—adding more heads dilutes efficacy and increases utility risk. The method's limitations include its current focus on query projections (extension to additional submodules such as MLPs is needed), static pre-alignment diagnosis that may not capture conflict drift, and reliance on calibration data that may require domain adaptation for specialized use cases.

Implications and Future Directions

CAST promotes a paradigm shift away from undifferentiated, global model interventions towards structure-aware, interpretable alignment strategies. This reframing has substantial practical implications for the design of scalable, safe model deployment pipelines, offering substantial reductions in both utility loss and operational costs (the one-off diagnostic overhead is minor compared to repeated full model tuning). The method's diagnostic utility suggests future avenues in adaptive, online conflict monitoring, extension to MLPs and other modules, and automated structure-aware regularization in multi-objective LLM training. Open theoretical questions remain regarding the generalization of local conflict maps across domains and architectures.

Conclusion

Safety alignment-induced capability loss in LLMs originates from the update of a sparse, functionally sensitive subset of heads exhibiting high optimization conflict—not from a global, diffuse process. The Conflict-Aware Sparse Tuning framework advances alignment by integrating modular interpretability with optimization. CAST's head-level diagnosis consistently enables fine-grained, parameter-efficient safety tuning, outperforms global baselines, and introduces a powerful predictive metric for preemptive risk management. The modular perspective on safety-utility conflicts developed herein sets a new trajectory for research on robust, interpretable LLM alignment.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.