Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models (2503.20807v1)

Published 24 Mar 2025 in stat.ML, cs.AI, cs.CL, and cs.LG

Abstract: Fine-tuning LLMs on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models (2503.20807v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (4)