Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models (2406.10288v2)

Published 12 Jun 2024 in cs.CL and cs.LG

Abstract: Fine-tuning LLMs on small, high-quality datasets can enhance their performance on specific downstream tasks. Recent research shows that fine-tuning on benign, instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. Although critical, understanding and mitigating safety risks in well-defined tasks remains distinct from the instruction-following context due to structural differences in the data. Our work addresses the gap in our understanding of these risks across diverse types of data in closed models - where providers control how user data is utilized in the fine-tuning process. We demonstrate how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is more effective than existing baselines at re-establishing safety alignment while maintaining similar task performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Francisco Eiras (17 papers)
  2. Aleksandar Petrov (21 papers)
  3. Phillip H. S. Torr (3 papers)
  4. M. Pawan Kumar (48 papers)
  5. Adel Bibi (53 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com